Saturday, May 5, 2012

S3 instead of HDFS with Hadoop

In this article we will discuss about using S3 as replacement of HDFS (Hadoop Distributed File System) on AWS (Amazon Web Services), and also about what is the need of using S3. Before coming to original use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3

Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.

As S3 is a storage sevice it offers a benefit to store & accumulate data on Amazon cloud without running any machine which can be then processed through MapReduce later. MapReduce can work with any Hadoop compatible file system & S3 fulfills that compatibility criteria. Apache Hadoop itself provides support for using S3. For MapReduce processing input is taken from S3 & output is again stored on to S3 which can be taken from there anytime later even after terminating cloud instances that processed the data.
BUT the point to be noted here is that this capability comes at the cost of data non localization. In HDFS data is stored local to machines while in S3 data first needs to be moved to individual system of cluster before they can process it & after processing again needs to be stored at S3 which is not as local as local file stem on machine. The data is actually transferred from S3 to local nodes it beats the Haddop philosophy moving computation to data is cheaper than moving data to computation. Although a very high speed network in same environment bridges this gap to a lot of extent. S3 speed might not be that much good external to Amazon cloud but internally with Amazon cloud it must be good. And this is also one reason why it is challenge to run MapReduce on other clouds as in same cloud environment they don’t have any Hadoop compatible file system like S3 is there in Amazon cloud environment.

So this is using S3 as a data store only. Which means that the job will first load data from S3 and bring it in local file system of the cluster(a huge IO effort on huge amount of data) and then run the map jobs and reduce jobs and then write the results back into S3 (another high amount of IO over a relatively slow network).

For configuration of S3 and Hadoop, please click here

No comments:

Post a Comment