Showing posts with label S3. Show all posts
Showing posts with label S3. Show all posts

S3 instead of HDFS with Hadoop

In this article we will discuss about using S3 as replacement of HDFS (Hadoop Distributed File System) on AWS (Amazon Web Services), and also about what is the need of using S3. Before coming to original use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3

Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.

Create Ubuntu AMI from Scratch on local machine

This guide will explain about creating AMI from scratch. Here we will create AMI on local system. The main benefit of creating AMI on local system is cost saving; we do not need to launch instance for configuring application. Instead we can configure our OS, install / configure required software and then create AMI on local system. Then we can upload newly created AMI on s3. Now from this AMI we can launch instance when we need them. In this way we will get pre-configured instance. In this tutorial we will create Ubuntu AMI from scratch. You can also follow same procedure on cloud (ie you can create this on instance also)

In this tutorial we will create (create AMI from scratch on local system), bundle (bundle the image), upload (upload newly created AMI on s3), run(run the instance based on this AMI) AMI.

What AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot

S3 as Input or Output for Hadoop MR jobs


How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.

What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.

Create an AMI


This blog will guide you through creating an Ubuntu AMI (Amazon Machine Image) from a launched Instance. In this tutorial we will create S3 backed AMI from running instance (Ubuntu). Before getting down to create an actual AMI let’s try to understand some basic terminologies:

Understand what AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot.

What is Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers
Create your own AMI, so that you can boot new custom instance which have all the required software preinstalled. Your AMI becomes basic unit of deployment; it will save your time of installing required software again and again.