Saturday, April 28, 2012

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)

2. Install Java
java 6 is recommended (either sun or open jdk)
Add repository (blow mentioned repository is for ubuntu 11.10 for other version please add corresponding repository)
$ sudo apt-get update
$ sudo apt-get install openjdk-6-jdk

3. Setup passwordless SSH (master to slave)
3.1 Setup SSH
$ ssh-keygen -t rsa -P ""

3.2 Setup password-less SSH (run it on master)
$ ssh-copy-id -i ~/.ssh/ slave01
$ ssh-copy-id -i ~/.ssh/ slave02

4. Untar Hadoop
$ tar xzf hadoop-0.20.2-cdh3u3.tar.gz

Now go to HADOOP_HOME Directory ( $ cd hadoop-0.20.2-cdh3u3)
(All the below mentioned commands are supposing that you are in HADOOP_HOME Directory)

5. Change configurations
5.1 Set java path in (in conf directory)
$ pico conf/
export JAVA_HOME=/usr/lib/jvm/path-to-your-java

5.2 Add entry of master in masters file
$ pico conf/masters

5.3 Add entry of slaves in slaves file
$ pico conf/slaves

5.4 Add following entry in core-site.xml
$ pico conf/core-site.xml


5.6 Add following entry in hdfs-site.xml
$ pico conf/hdfs-site.xml

5.7 Add following entry in mapred-site.xml
$ pico conf/mapred-site.xml

Hadoop is setup on master. Now setup Hadoop on slaves.
To setup hadoop on slaves repeat step 1 to 5 (except step 3) on all the slaves. 

6. Format the namenode, format namenode only once when you install hadoop, else It will delete all your data from HDFS
$ bin/hadoop namenode -format

Now start hadoop services
7. Start hadoop daemons          (run this command on master)
$ bin/
$ bin/

8. Check which daemons started, by running jps command:
8.1 On master
$ jps
5480 Jps
4985 NameNode
5293 JobTracker

8.2 On all the slaves:
$ jps
2540 Jps
2474 TaskTracker
2382 DataNode

You should configure secondary name node on a node other then master (can be on one slave). for this please enter the IP / name of that node in masters file

Above tutorial/commands/setup have been tested on:
OS: Ubuntu 12.04    (Will work with other ubuntu versions, other OS (like cent OS, redhat, etc))
Hadoop: CDH3U6   (Will work with other Hadoop version (like apache 0.20.X, cdh3*))
                                  (With little or no modification)


  1. Hi rahul,

    I am trying to install cloudera Hadoop.

    1. tar the CDH.
    2. make changes in

    but getting this error

    [root@hadoopdb bin]# ./
    starting namenode, logging to /home/Hadoop//logs/hadoop-root-namenode-hadoopdb.out
    May not run daemons as root. Please specify HADOOP_NAMENODE_USER
    hadoopdb: starting datanode, logging to /home/Hadoop//logs/hadoop-root-datanode-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_DATANODE_USER
    hadoopdb: starting secondarynamenode, logging to /home/Hadoop//logs/hadoop-root-secondarynamenode-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_SECONDARYNAMENODE_USER
    starting jobtracker, logging to /home/Hadoop//logs/hadoop-root-jobtracker-hadoopdb.out
    May not run daemons as root. Please specify HADOOP_JOBTRACKER_USER
    hadoopdb: starting tasktracker, logging to /home/Hadoop//logs/hadoop-root-tasktracker-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_TASKTRACKER_USER
    [root@hadoopdb bin]#

  2. Hi Amit,
    You are getting this problem due to you are deploying cloudera hadoop as root user, To solve this problem you have two options:

    1. work as non-root user
    2. Add following entry to

    1. As per ur guidence:


      [root@hadoopdb bin]# ./
      starting namenode, logging to /home/Hadoop//logs/hadoop-root-namenode-hadoopdb.out
      hadoopdb: starting datanode, logging to /home/Hadoop//logs/hadoop-root-datanode-hadoopdb.out
      hadoopdb: starting secondarynamenode, logging to /home/Hadoop//logs/hadoop-root-secondarynamenode-hadoopdb.out
      starting jobtracker, logging to /home/Hadoop//logs/hadoop-root-jobtracker-hadoopdb.out
      hadoopdb: starting tasktracker, logging to /home/Hadoop//logs/hadoop-root-tasktracker-hadoopdb.out
      [root@hadoopdb bin]# jps
      12783 Jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      [root@hadoopdb bin]# jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      12804 Jps
      [root@hadoopdb bin]# jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      12862 Jps

    2. Might be the "USER-NAME" you put in "" does not exists
      or some other reason

      Please scan your log files and post the error

  3. Hi Rahul...

    Can you me idea ... that how can i determine the replication factor for Hadoop Cluter. Let suppose we have setup of :
    one NameNode
    one SecondaryNameNode
    20 Datanode.

    Then What Replication Factor is set for this Setup or there is any formula to determine the RF for any Hadoop Cluster.

    Thanks in Advance.

    1. by default replication factor is 3
      you can determine/specify/change it in hdfs-site.xml

      By adding following entry:


  4. Hi..

    Is replication Factor 3 is enough for 20 Datanode setup.

    1. It depends on your requirements
      Ideally its OK

  5. Hi, I am new to Hadoop and Habse. My manager has given me a task to install hadoop and configure it for using Hbase.

    My current application is in java + rdbms oracle.

    Now want to migrate it to Hadoop +Hbase.

    Please let me know, what all things I need to do to accomplish this.

    This is urgent. Need to provide details related to RDBMS->Habse coversation and Hbase configuration for hadoop by tomorrow.

    Please help.

    My personal email id:

    or provide your contact details to contact u/


    1. Hi Rashmi,

      Firstly you need to install Hadoop (Steps are given above)
      Install HBase (please refer

      Do some R&D over Sqoop, it is used for migrating data from SQL to Hadoop

      For all your queries you can post here, We tries to Ans. as fast as possible

  6. Hi Rahul the tutorial was really helpfull but how do i monitor that all the slave nodes are actually working...??
    Is it mandatory that i config Ganglia or i can track it from the browser..??

    1. Hi Shankar,
      Installing Ganglia would be good idea to monitor all the resorces of the cluster,
      but if you just want to see all the slave nodes are actually working, you can find the same on YOUR-IP:50070

    2. cluster summery reads

      73 files and directories, 47 blocks = 120 total. Heap Size is 73.5 MB / 888.94 MB (8%)
      Configured Capacity : 94.4 GB
      DFS Used : 860 KB
      Non DFS Used : 6.07 GB
      DFS Remaining : 88.33 GB
      DFS Used% : 0 %
      DFS Remaining% : 93.57 %
      Live Nodes : 1
      Dead Nodes : 0
      Decommissioning Nodes : 0
      Number of Under-Replicated Blocks : 46

      Here it says Live Nodes = 1 where as i have connected 3 nodes as slave and one master...??

    3. If you have 3 data nodes it should be like
      Live Nodes : 3

      please check whether services on slaves are running by running jps

      also chk u have enabled passwordless ssh

    4. yes i have enabled passwordless ssh by copying of the master node into all the slave nodes.. Should i have to copy the of all the salve nodes into the master..??

      Slave 1:jps reads
      7116 TaskTracker
      7032 SecondaryNamenode
      6940 DataNode
      7567 JPS

      Slave 2:jps reads
      3719 JPS
      3067 Secondary Namenode
      3166 TaskTracker
      2981 DataNode
      3582 JobTracker

      Slave 3:jps reads
      6604 JPS
      6048 SecondaryNamenode
      6132 Tasktracker
      6476 JobTracker
      5956 DataNode

      Master : jps reads
      6661 JPS
      6193 DataNode
      6064 Namenode
      6357 Secondary Namenode
      6595 TaskTracker
      6461 JobTracker

      Im confused y the servies are started repeateadly!!

    5. there might be problem with your master and slave files, please chk
      also please run and command on master only

    6. Master

      8998 NameNode
      9652 Jps
      9409 JobTracker
      9540 TaskTracker
      9302 SecondaryNameNode
      9126 DataNode

      slave 1

      slave 2

      slave 3

    7. Please read basic architecture of Hadoop, There is one master and multiple slaves
      also in your conf/masters file put only one entry where you want to run secondary namenode

  7. Hi Rahul ,
    Is it possible to setup multi (4) node hadoop cluster setup using multiple linux shell.