Showing posts with label Cloud Storage. Show all posts
Showing posts with label Cloud Storage. Show all posts

Big Data Use-cases Across Industries, It's Becoming Bigger and Bigger

Big Data will become bigger and bigger, According to certain market forecasts it will increase by 1211%.
Big Data is increase at very fast speed, as the volume of data is growing like any thing, according to Gartner we create 2.5 Quintillion bytes of data per day (90% of the data in the world has been created in last two years alone).
An interesting infographic from Wikibon.org


Image source: wikibon.org/

Hbase at Facebook

After understanding basics of HBase, let’s try to understand How Facebook uses HBase, I have got very good tutorial from Facebook, how they are using HBase for messeging. This tutorial includes Introduction to Hbase, Why Hbase, MySQL to HbaseMigration at Facebook


 

Optimize Map Reduce Job Performance

Optimize Hadoop Performance. To improve Hadoop performance, you need to change various configuration parameter in core-site.xml, hdfs-site.xml, mapred-site.xml. The configuration / optimization of parameter to improve performance depends on the type of processing, it depends on case to case, there is no hard and fast rule.

To install Hadoop on ubuntu cluster you can refer this post

We can change block size, number of mappers and reducers, sort factor, jvm reuse, memory for java process, enable compression, map output compression, use combiner, etc.
I found a very nice description given by Cloudera



S3 instead of HDFS with Hadoop

In this article we will discuss about using S3 as replacement of HDFS (Hadoop Distributed File System) on AWS (Amazon Web Services), and also about what is the need of using S3. Before coming to original use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3

Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)

Create Ubuntu AMI from Scratch on local machine

This guide will explain about creating AMI from scratch. Here we will create AMI on local system. The main benefit of creating AMI on local system is cost saving; we do not need to launch instance for configuring application. Instead we can configure our OS, install / configure required software and then create AMI on local system. Then we can upload newly created AMI on s3. Now from this AMI we can launch instance when we need them. In this way we will get pre-configured instance. In this tutorial we will create Ubuntu AMI from scratch. You can also follow same procedure on cloud (ie you can create this on instance also)

In this tutorial we will create (create AMI from scratch on local system), bundle (bundle the image), upload (upload newly created AMI on s3), run(run the instance based on this AMI) AMI.

What AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot

Understanding Greenplum

This blog will guide you to understand Greenplum that includes what is Greenplum, its architecture, different segments, and its basics in details. In this Greenplum tutorial we will try to understand the capability and the architecture provided by Greenplum
 
What is Greenplum: Greenplum Database is a massively parallel processing (MPP) database server based on PostgreSQL open-source technology. MPP (also known as a shared nothing architecture) refers to systems with two or more processors which cooperate to carry out an operation - each processor with its own memory, operating system and disks.
Source: Greenplum
 
Greenplum Database Architecture: Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture). The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates
Source: Greenplum
 
 
Is Greenplum free: Greenplum is not free to setup for production or to setup cluster. But it has launched a community edition which is free but with pre specified guidelines.
 
What is Community edition: The EMC Greenplum Community Edition (CE) provides a powerful and comprehensive analytic environment enabling users to turn increasingly large amounts of data into useful insight. Developers, data scientists and other data professionals can experiment with real-world data, perform advanced analytics and most importantly - rapidly reveal insights from big data sets with ease
 
Parallel Query Optimizer: "Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan." It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.
 
Parallel Dataflow Engine:  At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner
 Source: Greenplum
 
 
Greenplum Database support (hardware requirements): Greenplum Database supported for production use on SUSE Linux Enterprise Server 10.2 (64-bit),  Red Hat Enterprise Linux 5.x (64-bit), CentOS Linux 5.x (64-bit) and Sun Solaris 10U5+ (64-bit). Greenplum Database 3.3 is supported on server hardware from a range of vendors including HP, Dell, Sun and IBM. Greenplum Database is supported for non-production (development and evaluation) use on Mac OSX 10.5, Red Hat Enterprise Linux 5.2 or higher (32-bit) and CentOS Linux 5.2 or higher (32-bit).
 
Greenplum Master Segment: "The master is the entry point to the Greenplum Database system. It is the database process that accepts client connections and processes the SQL commands issued by the users of the system". The master is where the global system catalog resides (the set of system tables that contain metadata about the Greenplum Database system itself), however the master does not contain any user data. Data resides only on the segments. The master does the work of authenticating client connections, processing the incoming SQL commands, distributing the work load between the segments, coordinating the results returned by each of the segments, and presenting the final results to the client program
 
Greenplum Primary Segments: In Greenplum Database, the segments are where the data is stored and where the majority of query processing takes place. User-defined tables and their indexes are distributed across the available number of segments in the Greenplum Database system, each segment containing a distinct portion of the data. Segment instances are the database server processes that serve segments. Users do not interact directly with the segments in a Greenplum Database system, but do so through the master.
 
Mirror Segment:Mirror segments allow database queries to fail over to a backup segment if the primary segment becomes unavailable. Mirror / secondary segment always resides on a different host than its primary
 
Source: Greenplum
 
 

Create an AMI


This blog will guide you through creating an Ubuntu AMI (Amazon Machine Image) from a launched Instance. In this tutorial we will create S3 backed AMI from running instance (Ubuntu). Before getting down to create an actual AMI let’s try to understand some basic terminologies:

Understand what AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot.

What is Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers
Create your own AMI, so that you can boot new custom instance which have all the required software preinstalled. Your AMI becomes basic unit of deployment; it will save your time of installing required software again and again.

Hue Installation and Configuration

This section describes instructions for cloudera Hue installation and change its default configuration like configure other database with hue and send notification/email of job completion etc...


Installing Hue on one machine with CDH in pseudo-distributed mode:


To Install Hue:
  • With this single command hue will get installed

$ sudo apt-get install hue