Saturday, April 28, 2012

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)



Prequisite:
2. Install Java
java 6 is recommended (either sun or open jdk)
Add repository (blow mentioned repository is for ubuntu 11.10 for other version please add corresponding repository)
$ sudo apt-get update
$ sudo apt-get install openjdk-6-jdk


3. Setup passwordless SSH (master to slave)
3.1 Setup SSH
$ ssh-keygen -t rsa -P ""

3.2 Setup password-less SSH (run it on master)
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave01
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave02

4. Untar Hadoop
$ tar xzf hadoop-0.20.2-cdh3u3.tar.gz

Now go to HADOOP_HOME Directory ( $ cd hadoop-0.20.2-cdh3u3)
(All the below mentioned commands are supposing that you are in HADOOP_HOME Directory)


5. Change configurations
5.1 Set java path in hadoop-env.sh (in conf directory)
$ pico conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/path-to-your-java

5.2 Add entry of master in masters file
$ pico conf/masters
slave-01

5.3 Add entry of slaves in slaves file
$ pico conf/slaves
slave01
slave02

5.4 Add following entry in core-site.xml
$ pico conf/core-site.xml
<property>
  <name>hadoop.tmp.dir</name>
  <value>/path/to/your/directory/hadoop-${user.name}</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:9000</value>
</property>

5.6 Add following entry in hdfs-site.xml
$ pico conf/hdfs-site.xml
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

5.7 Add following entry in mapred-site.xml
$ pico conf/mapred-site.xml
<property>
  <name>mapred.job.tracker</name>
  <value>master:9001</value>
</property>

Hadoop is setup on master. Now setup Hadoop on slaves.
To setup hadoop on slaves repeat step 1 to 5 (except step 3) on all the slaves. 


6. Format the namenode, format namenode only once when you install hadoop, else It will delete all your data from HDFS
$ bin/hadoop namenode -format


Now start hadoop services
7. Start hadoop daemons          (run this command on master)
$ bin/start-dfs.sh
$ bin/start-mapred.sh


8. Check which daemons started, by running jps command:
8.1 On master
$ jps
5480 Jps
4985 NameNode
5293 JobTracker

8.2 On all the slaves:
$ jps
2540 Jps
2474 TaskTracker
2382 DataNode

You should configure secondary name node on a node other then master (can be on one slave). for this please enter the IP / name of that node in masters file

Above tutorial/commands/setup have been tested on:
OS: Ubuntu 12.04    (Will work with other ubuntu versions, other OS (like cent OS, redhat, etc))
Hadoop: CDH3U6   (Will work with other Hadoop version (like apache 0.20.X, cdh3*))
                                  (With little or no modification)

19 comments:

  1. Hi rahul,

    I am trying to install cloudera Hadoop.

    1. tar the CDH.
    2. make changes in hadoop-env.sh

    but getting this error

    [root@hadoopdb bin]# ./start-all.sh
    starting namenode, logging to /home/Hadoop//logs/hadoop-root-namenode-hadoopdb.out
    May not run daemons as root. Please specify HADOOP_NAMENODE_USER
    hadoopdb: starting datanode, logging to /home/Hadoop//logs/hadoop-root-datanode-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_DATANODE_USER
    hadoopdb: starting secondarynamenode, logging to /home/Hadoop//logs/hadoop-root-secondarynamenode-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_SECONDARYNAMENODE_USER
    starting jobtracker, logging to /home/Hadoop//logs/hadoop-root-jobtracker-hadoopdb.out
    May not run daemons as root. Please specify HADOOP_JOBTRACKER_USER
    hadoopdb: starting tasktracker, logging to /home/Hadoop//logs/hadoop-root-tasktracker-hadoopdb.out
    hadoopdb: May not run daemons as root. Please specify HADOOP_TASKTRACKER_USER
    [root@hadoopdb bin]#

    ReplyDelete
  2. Hi Amit,
    You are getting this problem due to you are deploying cloudera hadoop as root user, To solve this problem you have two options:

    1. work as non-root user
    2. Add following entry to hadoop-env.sh
    export HADOOP_NAMENODE_USER="USER-NAME"
    export HADOOP_SECONDARYNAMENODE_USER="USER-NAME"
    export HADOOP_JOBTRACKER_USER="USER-NAME"
    export HADOOP_DATANODE_USER="USER-NAME"
    export HADOOP_TASKTRACKER_USER="USER-NAME"

    ReplyDelete
    Replies
    1. As per ur guidence:

      Result:

      [root@hadoopdb bin]# ./start-all.sh
      starting namenode, logging to /home/Hadoop//logs/hadoop-root-namenode-hadoopdb.out
      hadoopdb: starting datanode, logging to /home/Hadoop//logs/hadoop-root-datanode-hadoopdb.out
      hadoopdb: starting secondarynamenode, logging to /home/Hadoop//logs/hadoop-root-secondarynamenode-hadoopdb.out
      starting jobtracker, logging to /home/Hadoop//logs/hadoop-root-jobtracker-hadoopdb.out
      hadoopdb: starting tasktracker, logging to /home/Hadoop//logs/hadoop-root-tasktracker-hadoopdb.out
      [root@hadoopdb bin]# jps
      12783 Jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      12343
      [root@hadoopdb bin]# jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      12804 Jps
      12343
      [root@hadoopdb bin]# jps
      7740 QuorumPeerMain
      12440 SecondaryNameNode
      12343
      12862 Jps

      Delete
    2. Might be the "USER-NAME" you put in "hadoop-env.sh" does not exists
      or some other reason

      Please scan your log files and post the error

      Delete
  3. Hi Rahul...

    Can you me idea ... that how can i determine the replication factor for Hadoop Cluter. Let suppose we have setup of :
    one NameNode
    one SecondaryNameNode
    20 Datanode.

    Then What Replication Factor is set for this Setup or there is any formula to determine the RF for any Hadoop Cluster.

    Thanks in Advance.

    ReplyDelete
    Replies
    1. by default replication factor is 3
      you can determine/specify/change it in hdfs-site.xml

      By adding following entry:

      "
      dfs.replication
      3
      "

      Delete
  4. Hi..

    Is replication Factor 3 is enough for 20 Datanode setup.

    ReplyDelete
    Replies
    1. It depends on your requirements
      Ideally its OK

      Delete
  5. Hi, I am new to Hadoop and Habse. My manager has given me a task to install hadoop and configure it for using Hbase.

    My current application is in java + rdbms oracle.

    Now want to migrate it to Hadoop +Hbase.

    Please let me know, what all things I need to do to accomplish this.

    This is urgent. Need to provide details related to RDBMS->Habse coversation and Hbase configuration for hadoop by tomorrow.

    Please help.

    My personal email id: maheshwari.rashmi@gmail.com

    or provide your contact details to contact u/

    Rashmi
    9910165209

    ReplyDelete
    Replies
    1. Hi Rashmi,

      Firstly you need to install Hadoop (Steps are given above)
      Install HBase (please refer http://hbase.apache.org/book.html)

      Do some R&D over Sqoop, it is used for migrating data from SQL to Hadoop


      For all your queries you can post here, We tries to Ans. as fast as possible

      Delete
  6. Hi Rahul the tutorial was really helpfull but how do i monitor that all the slave nodes are actually working...??
    Is it mandatory that i config Ganglia or i can track it from the browser..??

    ReplyDelete
    Replies
    1. Hi Shankar,
      Installing Ganglia would be good idea to monitor all the resorces of the cluster,
      but if you just want to see all the slave nodes are actually working, you can find the same on YOUR-IP:50070

      Delete
    2. cluster summery reads

      73 files and directories, 47 blocks = 120 total. Heap Size is 73.5 MB / 888.94 MB (8%)
      Configured Capacity : 94.4 GB
      DFS Used : 860 KB
      Non DFS Used : 6.07 GB
      DFS Remaining : 88.33 GB
      DFS Used% : 0 %
      DFS Remaining% : 93.57 %
      Live Nodes : 1
      Dead Nodes : 0
      Decommissioning Nodes : 0
      Number of Under-Replicated Blocks : 46

      Here it says Live Nodes = 1 where as i have connected 3 nodes as slave and one master...??

      Delete
    3. If you have 3 data nodes it should be like
      Live Nodes : 3

      please check whether services on slaves are running by running jps

      also chk u have enabled passwordless ssh

      Delete
    4. yes i have enabled passwordless ssh by copying id_dsa.pub of the master node into all the slave nodes.. Should i have to copy the id_dsa.pub of all the salve nodes into the master..??

      Slave 1:jps reads
      7116 TaskTracker
      7032 SecondaryNamenode
      6940 DataNode
      7567 JPS

      Slave 2:jps reads
      3719 JPS
      3067 Secondary Namenode
      3166 TaskTracker
      2981 DataNode
      3582 JobTracker

      Slave 3:jps reads
      6604 JPS
      6048 SecondaryNamenode
      6132 Tasktracker
      6476 JobTracker
      5956 DataNode

      Master : jps reads
      6661 JPS
      6193 DataNode
      6064 Namenode
      6357 Secondary Namenode
      6595 TaskTracker
      6461 JobTracker

      Im confused y the servies are started repeateadly!!

      Delete
    5. there might be problem with your master and slave files, please chk
      also please run start-dfs.sh and start-mapred.sh command on master only

      Delete
    6. Master

      8998 NameNode
      9652 Jps
      9409 JobTracker
      9540 TaskTracker
      9302 SecondaryNameNode
      9126 DataNode

      slave 1
      DataNode
      Tasktracker
      SecondaryNamenode

      slave 2
      SecondaryNamenode
      Tasktracker
      DataNode

      slave 3
      DataNode
      SecondaryNamenode
      Tasktracker

      Delete
    7. Please read basic architecture of Hadoop, There is one master and multiple slaves
      also in your conf/masters file put only one entry where you want to run secondary namenode

      Delete
  7. Hi Rahul ,
    Is it possible to setup multi (4) node hadoop cluster setup using multiple linux shell.

    ReplyDelete