Friday, March 4, 2011

Hadoop in Pseudo Distributed Mode

After Running Hadoop in Standalone mode Lets deploy Hadoop on Single Machine:


This section contains instructions for Hadoop installation on ubuntu. This is Hadoop quickstart tutorial to setup Hadoop quickly. This is shortest tutorial of Hadoop installation, here you will get all the commands and their description required to install Hadoop in Pseudo distributed mode (single node cluster) In this tutorial, I will describe required steps for deploying Hadoop. The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.


This Tutorial has been tested on:
  • Ubuntu Linux (10.04 LTS)
  • Hadoop 0.20.2
Prerequisites:
Install Java: 
Java 1.6.x (either Sun Java or Open Java) is recommended for Hadoop

1. Add the Canonical Partner Repository to your apt repositories(If you are using ubuntu version other then 10.04 then add repository corresponding to that version):

    $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"   


2. Update the source list

    $ sudo apt-get update   


3. Install sun-java6-jdk

    $ sudo apt-get install sun-java6-jdk   


4. After installation, make a quick check whether Sun’s JDK is correctly set up:

    user@ubuntu:~# java -version
    java version "1.6.0_20"
    Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
    Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)   


Adding a dedicated Hadoop system user:
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc)


    $ sudo adduser hadoop_admin   



Configuring SSH:
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop_admin user

    user@ubuntu:~$ su - hadoop_admin   
    hadoop_admin@ubuntu:~$ sudo apt-get install openssh-server openssh-client

    hadoop_admin@ubuntu:~$ ssh-keygen -t rsa -P ""
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop_admin/.ssh/id_rsa):
    Created directory '/home/hadoop_admin/.ssh'.
    Your identification has been saved in /home/hadoop_admin/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop_admin/.ssh/id_rsa.pub.
    The key fingerprint is:
    9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hadoop_admin@ubuntu
    The key's randomart image is:
    [...snipp...]
    hadoop_admin@ubuntu:~$




Enable SSH access to your local machine and connect using ssh

    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    $ ssh localhost
    The authenticity of host 'localhost (::1)' can't be established.
    RSA key fingerprint is e7:89:26:49:ae:02:30:eb:1d:75:4f:bb:44:f9:36:29.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
    Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 30 13:27:30 UTC 2010 i686 GNU/Linux
    Ubuntu 10.04 LTS
    [...snipp...]
    $


Hadoop Installation:


    $ cd /usr/local
    $ sudo tar xzf hadoop-0.20.2.tar.gz
    $ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2   


Configuration:


Define JAVA_HOME:

Edit configuration file /usr/local/hadoop-0.20.2/conf/hadoop-env.sh and set JAVA_HOME:
export JAVA_HOME=path to be the root of your Java installation(eg: /usr/lib/jvm/java-6-sun)

    $ vi conf/hadoop-env.sh   


Edit configuration files:


    $ vi conf/core-site.xml

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    </property>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-${user.name}</value>
    </property>

    </configuration>

If you give some other path, ensure that hadoop_admin user have read, write permission in that directory (sudo chown hadoop_admin /your/path)



    $ vi conf/hdfs-site.xml

    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>



    $ vi conf/mapred.xml

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
    </property>
    </configuration>




Formatting the name node:

    $ /hadoop/bin/hadoop namenode -format   

It will generate following output:


    $ bin/hadoop namenode -format
    10/05/10 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = ubuntu/127.0.1.1
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 0.20.2
    STARTUP_MSG:   build =     https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
    ************************************************************/
    10/05/10 16:59:56 INFO namenode.FSNamesystem: fsOwner=hadoop_admin,hadoop
    10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
    10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
    10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.   
    10/05/08 16:59:57 INFO common.Storage: Storage directory .../.../dfs/name has been successfully formatted.
    10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
    ************************************************************/
    $



Starting single-node cluster:


    $ /bin/start-all.sh   

It will generate following output:


    hadoop_admin@ubuntu:/usr/local/hadoop$ bin/start-all.sh
    starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out
    localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ubuntu.out
    localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out  
    starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ubuntu.out
    localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ubuntu.out
    hadoop_admin@ubuntu:/usr/local/hadoop$


check whether the expected Hadoop processes are running by jps


    $ jps
    14799 NameNode
    14977 SecondaryNameNode
    15183 DataNode
    15596 JobTracker
    15897 TaskTracker




Hadoop Setup in Pseudo Distributed Mode is Completed.......!!!!!!!

Stopping single-node cluster:


    $ /bin/stop-all.sh   

It Will generate following output:


    $ bin/stop-all.sh
    stopping jobtracker
    localhost: stopping tasktracker
    stopping namenode
    localhost: stopping datanode
    localhost: stopping secondarynamenode
    $




Now lets run some examples(Run MapReduce Job):
1. Run Classic Pi example:

    $ bin/hadoop jar hadoop-*-examples.jar pi 10 100   


2. Run grep example:

    $ bin/hadoop dfs -mkdir input
    $ bin/hadoop dfs -put conf input
    $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
    $ bin/hadoop dfs -cat output/*




3. Run word count example:

    $ bin/hadoop dfs -mkdir inputwords
    $ bin/hadoop dfs -put conf inputwords
    $ bin/hadoop jar hadoop-*-examples.jar wordcount inputwords outputwords
    $ bin/hadoop dfs -cat outputwords/*




Web based Interface for NameNode
http://localhost:50070


Web based Interface for JobTracker
http://localhost:50030

Web based Interface for TaskTracker


1 comment: