Create Ubuntu AMI from Scratch on local machine

This guide will explain about creating AMI from scratch. Here we will create AMI on local system. The main benefit of creating AMI on local system is cost saving; we do not need to launch instance for configuring application. Instead we can configure our OS, install / configure required software and then create AMI on local system. Then we can upload newly created AMI on s3. Now from this AMI we can launch instance when we need them. In this way we will get pre-configured instance. In this tutorial we will create Ubuntu AMI from scratch. You can also follow same procedure on cloud (ie you can create this on instance also)

In this tutorial we will create (create AMI from scratch on local system), bundle (bundle the image), upload (upload newly created AMI on s3), run(run the instance based on this AMI) AMI.

What AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot

S3 as Input or Output for Hadoop MR jobs


How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.

What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.

Installing Greenplum SNE / CE

After understanding basics of Greenplum, let’s Setup Greenplum Community edition (CE) or single node edition (SNE). In this Greenplum tutorial we will see how to install and Initialize Greenplum Database SNE. This section provides instructions on how to install the Greenplum Database SNE software and get your single-node Greenplum Database SNE system up and running. In the following Greenplum quick start we will use CentOS 5.X or above.
Before installing we have to change following OS configuration parameters:
Set the following parameters in the /etc/sysctl.conf file and reboot:
kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 64000 100 512
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_max_syn_backlog=4096
net.core.netdev_max_backlog=10000
vm.overcommit_memory=2
Set the following parameters in the /etc/security/limits.conf file:
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072
Add the Greenplum database Admin account:
# useradd gpadmin
# passwd gpadmin
# New password: password
# Retype new password: password
You cannot run the Greenplum Database SNE server as root. While dealing with Greenplum use this newly created user account
Installing the Greenplum Database SNE / Community edition:
1. Download or copy the Greenplum Database SNE / CE from:
www.greenplum.com/
2. Unzip the installer file:
# unzip greenplum-db-4.0.0.0-build-#-SingleNodeEdition-PLATFORM.zip
3. Launch the installer using bash:
# /bin/bash greenplum-db-4.0.0.0-build-#-SingleNodeEdition-PLATFORM.bin
4. The installer prompts you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-4.0.0.0), or enter new path
5. The installer installs the Greenplum Database SNE /CE software and creates a greenplum-db symbolic link one directory level above your version-specific Greenplum Database SNE
6. Change the ownership of your Greenplum Database SNE installation so that it is owned by the gpadmin
# chown -R gpadmin /usr/local/greenplum-db-4.0.0.0
# chgrp -R gpadmin /usr/local/greenplum-db-4.0.0.0
7. Preparing the Data Directory Locations
Every Greenplum Database SNE instance has a designated storage area on disk that is called the data directory location.
8. Create or choose a directory that will serve as your master data storage area
On this location user data is not stored, instead metadata (data about the data) is stored. Here global system catalog resides
# mkdir /gpmaster
# chown gpadmin /gpmaster
# chgrp gpadmin /gpmaster
9. Create or choose the directories that will serve as your segment storage areas:
This is the file system location where the database data is stored.
# mkdir /gpdata1
# chown gpadmin /gpdata1
# chgrp gpadmin /gpdata1
# mkdir /gpdata2
# chown gpadmin /gpdata2
# chgrp gpadmin /gpdata2
10. Configuring Greenplum Database SNE / CE Environment Variables:
$ vi .bashrc
Then add following entry
source /usr/local/greenplum-db/greenplum_path.sh
now source it
$ source ~/.bashrc
11. Now let’s initialize Greenplum database:
Greenplum provides a utility called gpinitsystem which initializes a Greenplum Database system. After the Greenplum Database SNE system is initialized and started, you can then create and manage databases by connecting to the Greenplum master database process.
12. Log in to the system as the gpadmin user:
# su - gpadmin
13. Copy the single_hostlist example file from your Greenplum Database SNE installation to the current directory:
$ cp $GPHOME/docs/cli_help/single_hostlist_example/single_hostlist
14. Copy the gp_init_singlenode example file from your Greenplum Database SNE installation to the current directory:
$ cp $GPHOME/docs/cli_help/gp_init_singlenode_example/gp_init_singlenode
15. Edit the gp_init_singlenode file and enter your configuration settings, you can remain them default. Some default parameters in this file are:
ARRAY_NAME="GPDB SNE"
MACHINE_LIST_FILE=./single_hostlist
SEG_PREFIX=gpsne
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/gpdata1 /gpdata2)
MASTER_HOSTNAME=127.0.0.1
MASTER_DIRECTORY=/gpmaster
MASTER_PORT=5432
16. Run the gpssh-exkeys utility to exchange ssh keys for the local host:
$ gpssh-exkeys -h 127.0.0.1
17. initialize Greenplum Database SNE:
$ gpinitsystem -c gp_init_singlenode
18. After the Greenplum Database SNE system is initialized and started, you can connect to the Greenplum master database process using the psql client program:
$ createdb mydb
$ psql mydb
19. Now export master data directory:
$ vi .bashrc
Then add following entry
export MASTER_DATA_DIRECTORY=/gpmaster/gpsne-1
now source it
$ source ~/.bashrc
20. Now you can perform any database operations using psql program (DDL, DML)
Uninstall Greenplum:
To uninstall run the following commands:
$ gpdeletesystem -d /gpmaster/gpsne-1
$ rm -rf /usr/local/greenplum-db-4.0.0.0
$ rm /usr/local/greenplum-db
You can remove the environment variable and restore the default setting of OS parameters(Optional)