In this article we will discuss about using S3 as replacement of HDFS (Hadoop
Distributed File System) on AWS (Amazon Web Services), and also about what is
the need of using S3. Before coming to original
use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3
Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.
As S3 is a storage sevice it offers a benefit to store & accumulate data on Amazon cloud without running any machine which can be then processed through MapReduce later. MapReduce can work with any Hadoop compatible file system & S3 fulfills that compatibility criteria. Apache Hadoop itself provides support for using S3. For MapReduce processing input is taken from S3 & output is again stored on to S3 which can be taken from there anytime later even after terminating cloud instances that processed the data.
Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.
As S3 is a storage sevice it offers a benefit to store & accumulate data on Amazon cloud without running any machine which can be then processed through MapReduce later. MapReduce can work with any Hadoop compatible file system & S3 fulfills that compatibility criteria. Apache Hadoop itself provides support for using S3. For MapReduce processing input is taken from S3 & output is again stored on to S3 which can be taken from there anytime later even after terminating cloud instances that processed the data.
BUT the point to be noted here is that this capability
comes at the cost of data non localization. In HDFS data is stored local to
machines while in S3 data first needs to be moved to individual system of
cluster before they can process it & after processing again needs to be
stored at S3 which is not as local as local file stem on machine. The data is
actually transferred from S3 to local nodes it beats the Haddop philosophy
moving computation to data is cheaper than moving data to computation. Although
a very high speed network in same environment bridges this gap to a lot of
extent. S3 speed might not be that much good external to Amazon cloud but
internally with Amazon cloud it must be good. And this is also one reason why
it is challenge to run MapReduce on other clouds as in same cloud environment
they don’t have any Hadoop compatible file system like S3 is there in Amazon
cloud environment.
So this is using S3 as a data store only. Which means that
the job will first load data from S3 and bring it in local file system of the
cluster(a huge IO effort on huge amount of data) and then run the map jobs and
reduce jobs and then write the results back into S3 (another high amount of IO
over a relatively slow network).
For configuration of S3 and Hadoop, please click here
For configuration of S3 and Hadoop, please click here
No comments:
Post a Comment