Thursday, May 5, 2011

Create Ubuntu AMI from Scratch on local machine

This guide will explain about creating AMI from scratch. Here we will create AMI on local system. The main benefit of creating AMI on local system is cost saving; we do not need to launch instance for configuring application. Instead we can configure our OS, install / configure required software and then create AMI on local system. Then we can upload newly created AMI on s3. Now from this AMI we can launch instance when we need them. In this way we will get pre-configured instance. In this tutorial we will create Ubuntu AMI from scratch. You can also follow same procedure on cloud (ie you can create this on instance also)

In this tutorial we will create (create AMI from scratch on local system), bundle (bundle the image), upload (upload newly created AMI on s3), run(run the instance based on this AMI) AMI.

What AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot

S3 as Input or Output for Hadoop MR jobs


How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.

What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.

Wednesday, April 13, 2011

Installing Greenplum SNE / CE

After understanding basics of Greenplum, let’s Setup Greenplum Community edition (CE) or single node edition (SNE). In this Greenplum tutorial we will see how to install and Initialize Greenplum Database SNE. This section provides instructions on how to install the Greenplum Database SNE software and get your single-node Greenplum Database SNE system up and running. In the following Greenplum quick start we will use CentOS 5.X or above.
Before installing we have to change following OS configuration parameters:
Set the following parameters in the /etc/sysctl.conf file and reboot:
kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 64000 100 512
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_max_syn_backlog=4096
net.core.netdev_max_backlog=10000
vm.overcommit_memory=2
Set the following parameters in the /etc/security/limits.conf file:
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072
Add the Greenplum database Admin account:
# useradd gpadmin
# passwd gpadmin
# New password: password
# Retype new password: password
You cannot run the Greenplum Database SNE server as root. While dealing with Greenplum use this newly created user account
Installing the Greenplum Database SNE / Community edition:
1. Download or copy the Greenplum Database SNE / CE from:
www.greenplum.com/
2. Unzip the installer file:
# unzip greenplum-db-4.0.0.0-build-#-SingleNodeEdition-PLATFORM.zip
3. Launch the installer using bash:
# /bin/bash greenplum-db-4.0.0.0-build-#-SingleNodeEdition-PLATFORM.bin
4. The installer prompts you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-4.0.0.0), or enter new path
5. The installer installs the Greenplum Database SNE /CE software and creates a greenplum-db symbolic link one directory level above your version-specific Greenplum Database SNE
6. Change the ownership of your Greenplum Database SNE installation so that it is owned by the gpadmin
# chown -R gpadmin /usr/local/greenplum-db-4.0.0.0
# chgrp -R gpadmin /usr/local/greenplum-db-4.0.0.0
7. Preparing the Data Directory Locations
Every Greenplum Database SNE instance has a designated storage area on disk that is called the data directory location.
8. Create or choose a directory that will serve as your master data storage area
On this location user data is not stored, instead metadata (data about the data) is stored. Here global system catalog resides
# mkdir /gpmaster
# chown gpadmin /gpmaster
# chgrp gpadmin /gpmaster
9. Create or choose the directories that will serve as your segment storage areas:
This is the file system location where the database data is stored.
# mkdir /gpdata1
# chown gpadmin /gpdata1
# chgrp gpadmin /gpdata1
# mkdir /gpdata2
# chown gpadmin /gpdata2
# chgrp gpadmin /gpdata2
10. Configuring Greenplum Database SNE / CE Environment Variables:
$ vi .bashrc
Then add following entry
source /usr/local/greenplum-db/greenplum_path.sh
now source it
$ source ~/.bashrc
11. Now let’s initialize Greenplum database:
Greenplum provides a utility called gpinitsystem which initializes a Greenplum Database system. After the Greenplum Database SNE system is initialized and started, you can then create and manage databases by connecting to the Greenplum master database process.
12. Log in to the system as the gpadmin user:
# su - gpadmin
13. Copy the single_hostlist example file from your Greenplum Database SNE installation to the current directory:
$ cp $GPHOME/docs/cli_help/single_hostlist_example/single_hostlist
14. Copy the gp_init_singlenode example file from your Greenplum Database SNE installation to the current directory:
$ cp $GPHOME/docs/cli_help/gp_init_singlenode_example/gp_init_singlenode
15. Edit the gp_init_singlenode file and enter your configuration settings, you can remain them default. Some default parameters in this file are:
ARRAY_NAME="GPDB SNE"
MACHINE_LIST_FILE=./single_hostlist
SEG_PREFIX=gpsne
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/gpdata1 /gpdata2)
MASTER_HOSTNAME=127.0.0.1
MASTER_DIRECTORY=/gpmaster
MASTER_PORT=5432
16. Run the gpssh-exkeys utility to exchange ssh keys for the local host:
$ gpssh-exkeys -h 127.0.0.1
17. initialize Greenplum Database SNE:
$ gpinitsystem -c gp_init_singlenode
18. After the Greenplum Database SNE system is initialized and started, you can connect to the Greenplum master database process using the psql client program:
$ createdb mydb
$ psql mydb
19. Now export master data directory:
$ vi .bashrc
Then add following entry
export MASTER_DATA_DIRECTORY=/gpmaster/gpsne-1
now source it
$ source ~/.bashrc
20. Now you can perform any database operations using psql program (DDL, DML)
Uninstall Greenplum:
To uninstall run the following commands:
$ gpdeletesystem -d /gpmaster/gpsne-1
$ rm -rf /usr/local/greenplum-db-4.0.0.0
$ rm /usr/local/greenplum-db
You can remove the environment variable and restore the default setting of OS parameters(Optional)

Tuesday, April 5, 2011

Understanding Greenplum

This blog will guide you to understand Greenplum that includes what is Greenplum, its architecture, different segments, and its basics in details. In this Greenplum tutorial we will try to understand the capability and the architecture provided by Greenplum
 
What is Greenplum: Greenplum Database is a massively parallel processing (MPP) database server based on PostgreSQL open-source technology. MPP (also known as a shared nothing architecture) refers to systems with two or more processors which cooperate to carry out an operation - each processor with its own memory, operating system and disks.
Source: Greenplum
 
Greenplum Database Architecture: Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture). The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates
Source: Greenplum
 
 
Is Greenplum free: Greenplum is not free to setup for production or to setup cluster. But it has launched a community edition which is free but with pre specified guidelines.
 
What is Community edition: The EMC Greenplum Community Edition (CE) provides a powerful and comprehensive analytic environment enabling users to turn increasingly large amounts of data into useful insight. Developers, data scientists and other data professionals can experiment with real-world data, perform advanced analytics and most importantly - rapidly reveal insights from big data sets with ease
 
Parallel Query Optimizer: "Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan." It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.
 
Parallel Dataflow Engine:  At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner
 Source: Greenplum
 
 
Greenplum Database support (hardware requirements): Greenplum Database supported for production use on SUSE Linux Enterprise Server 10.2 (64-bit),  Red Hat Enterprise Linux 5.x (64-bit), CentOS Linux 5.x (64-bit) and Sun Solaris 10U5+ (64-bit). Greenplum Database 3.3 is supported on server hardware from a range of vendors including HP, Dell, Sun and IBM. Greenplum Database is supported for non-production (development and evaluation) use on Mac OSX 10.5, Red Hat Enterprise Linux 5.2 or higher (32-bit) and CentOS Linux 5.2 or higher (32-bit).
 
Greenplum Master Segment: "The master is the entry point to the Greenplum Database system. It is the database process that accepts client connections and processes the SQL commands issued by the users of the system". The master is where the global system catalog resides (the set of system tables that contain metadata about the Greenplum Database system itself), however the master does not contain any user data. Data resides only on the segments. The master does the work of authenticating client connections, processing the incoming SQL commands, distributing the work load between the segments, coordinating the results returned by each of the segments, and presenting the final results to the client program
 
Greenplum Primary Segments: In Greenplum Database, the segments are where the data is stored and where the majority of query processing takes place. User-defined tables and their indexes are distributed across the available number of segments in the Greenplum Database system, each segment containing a distinct portion of the data. Segment instances are the database server processes that serve segments. Users do not interact directly with the segments in a Greenplum Database system, but do so through the master.
 
Mirror Segment:Mirror segments allow database queries to fail over to a backup segment if the primary segment becomes unavailable. Mirror / secondary segment always resides on a different host than its primary
 
Source: Greenplum
 
 

Monday, March 28, 2011

Create an AMI


This blog will guide you through creating an Ubuntu AMI (Amazon Machine Image) from a launched Instance. In this tutorial we will create S3 backed AMI from running instance (Ubuntu). Before getting down to create an actual AMI let’s try to understand some basic terminologies:

Understand what AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot.

What is Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers
Create your own AMI, so that you can boot new custom instance which have all the required software preinstalled. Your AMI becomes basic unit of deployment; it will save your time of installing required software again and again.

Wednesday, March 16, 2011

Hue Features

This Hue Tutorial will describe features of Cloudera Hue

Cloudera Hue is a web based UI for hadoop. It was started as Cloudera Desktop later named to hue. Hue provides following features:
  • Beeswax
  • File Browser
  • Job Designer
  • Job Browser
  • User Admin

Hue Installation and Configuration

This section describes instructions for cloudera Hue installation and change its default configuration like configure other database with hue and send notification/email of job completion etc...


Installing Hue on one machine with CDH in pseudo-distributed mode:


To Install Hue:
  • With this single command hue will get installed

$ sudo apt-get install hue

Friday, March 4, 2011

Running Cloudera in Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Distributed mode (multi node cluster)

Prerequisite: Before starting Cloudera in distributed mode you must setup Cloudera in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine(cluster) on a single machine).


Running Cloudera in Pseudo Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Pseudo distributed mode (single node cluster)

Following steps tested on
Hadoop: CDH (Cloudera Distribution of Apache Hadoop)
OS: Ubuntu

Running Cloudera in Standalone Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Standalone mode(single node cluster)

Following steps tested on
Hadoop: CDH (Cloudera Distribution of Apache Hadoop)
OS: Ubuntu

Upgrade Hadoop

This section contains instructions to Upgrade Hadoop from one version to another. This tutorial explain how to Upgrade Hadoop, which is running in distributed mode(ie on cluster) without loss of data. Before starting upgrade procedure please ensure that no job is running.

COMMAND DESCRIPTION
bin/stop-mapred.sh Stop map-reduce cluster(s) and all client

applications running on the DFS cluster
bin/stop-dfs.sh Stop DFS using the shutdown command.
Install new version of Hadoop (On all the nodes in the cluster)

bin/start-dfs.sh -upgrade Start DFS cluster with -upgrade option
bin/start-mapred.sh Start map-reduce cluster
bin/hadoop dfsadmin --finalizeUpgrade Verify the components run properly and
then finalize the upgrade when convinced
If you get any error visit Hadoop Troubleshooting


Hadoop in Distributed Mode

This section contains instructions for Hadoop installation on ubuntu. This is Hadoop quickstart tutorial to setup Hadoop quickly. This is shortest tutorial of Hadoop installation, here you will get all the commands and their description required to install Hadoop in distributed mode(multi node cluster)

Prerequisite: Before starting hadoop in distributed mode you must setup hadoop in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine on a single machine).




Following steps tested on:
OS: ubuntu
Hadoop: Apache Hadoop 0.20.X

Hadoop in Pseudo Distributed Mode

After Running Hadoop in Standalone mode Lets deploy Hadoop on Single Machine:


This section contains instructions for Hadoop installation on ubuntu. This is Hadoop quickstart tutorial to setup Hadoop quickly. This is shortest tutorial of Hadoop installation, here you will get all the commands and their description required to install Hadoop in Pseudo distributed mode (single node cluster) In this tutorial, I will describe required steps for deploying Hadoop. The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.


This Tutorial has been tested on:
  • Ubuntu Linux (10.04 LTS)
  • Hadoop 0.20.2

Hadoop in Standalone Mode

After Understanding What is Hadoop Lets deploy Hadoop on Single Machine:


This section contains instructions for Hadoop installation on ubuntu. This is Hadoop quickstart tutorial to setup Hadoop quickly. This is shortest step by step tutorial of Hadoop installation, here you will get all the commands and their description required to install Hadoop in Standalone mode(single node cluster). In this tutorial, I will describe required steps for deploying Hadoop. The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.


This Tutorial has been tested on:
  • Ubuntu Linux (10.04 LTS)
  • Hadoop 0.20.2

Understanding What is Hadoop


What is Hadoop:
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets (In the range of terabytes to zetabytes).