Thursday, May 5, 2011

S3 as Input or Output for Hadoop MR jobs


How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.

What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.


Hadoop provides two filesystems that use S3:

S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on s3. The advantage of this filesystem is that you can access files on s3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by s3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by s3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other s3 tools.

There are two ways that s3 can be used with hadoop's Map/Reduce, either as a replacement for HDFS using the s3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either s3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. Note also, that by using s3 as an input to MapReduce you lose the data locality optimization, which may be significant.

In this tutorial we will configure S3n as input / output for hadoop MR jobs, as we can upload any file (file of any format ) for input.

Using S3n as Input / Output for Hadoop MapReduce job:
Before setting hadoop to use s3 as IO for MapReduce job firstly setup hadoop in either pseudo distributed mode or in distributed mode

Now add the following entry to hdfs-site.xml
$ vi HADOOP_INSTALL_DIR/conf/hdfs-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS-ID</value>
</property>

<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS-SECRET-KEY</value>
</property>


Now restart Hadoop Daemons:
$ HADOOP_INSTALL_DIR/bin/stop-all.sh
$ HADOOP_INSTALL_DIR/bin/start-all.sh
Now submit a MapReduce job:
$ HADOOP_INSTALL_DIR/bin/hadoop jar hadoop-*-examples.jar wordcount  s3n://BUCKET-NAME/ s3n://BUCKET-NAME/DIRECTORY-NAME

Note that for input you must create bucket manually and upload files in that bucket, for output you must create bucket and directory specified for output must not already exist

Alternatively,
You can submit MapReduce job, without setting configuration parameters:
HADOOP_INSTALL_DIR/bin/hadoop jar hadoop-*-examples.jar wordcount  s3n://AWS-ID: AWS-SECRET-KEY@BUCKET-NAME/ s3n:// AWS-ID: AWS-SECRET-KEY@BUCKET-NAME/DIRECTORY-NAME

Note that since the secret access key can contain slashes, you must remember to escape them by replacing each slash / with the string %2F.


YOUR-AWS-ACCESS-ID:   In the web browser click “account >>security credentials” under heading “access credentials >> access keys”
        YOUR-AWS-ACCESS-KEY:   In the web browser click “account >>security credentials” under heading “access credentials >> access keys

4 comments:

  1. Nice Article....I always love to learn more about S3...I am a happy user of Bucket Explorer Amazon S3 Client..it provides you many features,handle your stuff in easy way...www.bucketexplorer.com/

    ReplyDelete
  2. Hey Rahul, a very good article thanks!!

    have you tried multipart upload of data using s3n, please le me know if you have had any experience with it.

    Thanks

    ReplyDelete
  3. Hi Rahul, very informative article. Just had a query:

    If I run a map/reduce program on EC2, i need not used HDFS at all then since S3 is used as input/output? In case i use S3 for I/O, will my map/reduce program perform slow considering the amount of data transfer?

    Thanks.

    ReplyDelete
    Replies
    1. Hi Meenal,
      I Think this post will answer your question
      http://www.technology-mania.com/2012/05/s3-instead-of-hdfs-with-hadoop_05.html

      Delete