Friday, March 4, 2011

Hadoop in Distributed Mode

This section contains instructions for Hadoop installation on ubuntu. This is Hadoop quickstart tutorial to setup Hadoop quickly. This is shortest tutorial of Hadoop installation, here you will get all the commands and their description required to install Hadoop in distributed mode(multi node cluster)

Prerequisite: Before starting hadoop in distributed mode you must setup hadoop in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine on a single machine).

Following steps tested on:
OS: ubuntu
Hadoop: Apache Hadoop 0.20.X



Deploy Hadoop in Distributed Mode:
  
COMMAND DESCRIPTION
$ bin/stop-all.sh

Before starting hadoop in distributed mode first stop each cluster.

run this cmd on all machines in cluster (master and slave) 
$ vi /etc/hosts Then type
IP-add master(eg: 192.168.0.1 master)
IP-add slave(eg: 192.168.0.2 slave)

run this cmd on all machines in cluster (master and slave) 
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave setting passwordless ssh
(on all the machines you must login with same user name)

run this cmd on master
or
$ cat .ssh/id_rsa.pub
Then Its content is then copied in 
$ .ssh/authorized_keys file of the slave (system you wish to SSH to without being prompted for a password)
we can also set passwordless ssh manually

$ vi conf/master
then type master
The conf/masters file defines the namenodes of our multi-node cluster

run this cmd on master 

$ vi conf/slaves
then type slave
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run.

run this cmd on all machines in cluster (master and slave)   

$ vi conf/core-site.xml
then type:
<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
 </property>

Edit configuration file core-site.xml



run this cmd on all machines in cluster (master and slave)  

$ vi conf/mapred-site.xml
then type:
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
 </property>

Edit configuration file mapred-site.xml



run this cmd on all machines in cluster (master and slave)  
$ vi conf/hdfs-site.xml
then type:

<property>
  <name>dfs.replication</name>
  <value>2</value> </property>
Edit configuration file hdfs-site.xml


run this cmd on all machines in cluster (master and slave)  

$ vi conf/mapred-site.xml
then type:
<property>
  <name>mapred.local.dir</name>
  <value>${hadoop.tmp.dir}/mapred/local</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>20</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>2</value> 
</property>



Edit configuration file mapred-site.xml













run this cmd on master

$ bin/start-dfs.sh

Starting the multi-node cluster. First, the HDFS daemons are started. the namenode daemon is started on master, and datanode daemons are started on all slaves

run this cmd on master   

$ jps


It should give output like this:
14799 NameNode
15314 Jps
16977 secondaryNameNode


run this cmd on master 

$ jps



It should give output like this:
15183 DataNode
15616 Jps

run this cmd on all slaves

$ bin/start-mapred.sh

the MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves

run this cmd on master 

$ jps


It should give output like this:
16017 Jps
14799 NameNode
15596 JobTracker
14977 SecondaryNameNode

run this cmd on master 

$ jps


It should give output like this:
15183 DataNode
15897 TaskTracker
16284 Jps



run this cmd on all slaves 
Congratulations Hadoop Setup is Completed
http://localhost:50070/ web based interface for name node
http://localhost:50030/ web based interface for job tracker
Now lets run some examples
$ bin/hadoop jar hadoop-*-examples.jar pi 10 100 run pi example
$ bin/hadoop dfs -mkdir input
$ bin/hadoop dfs -put conf input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ bin/hadoop dfs -cat output/*
run grep example
$ bin/hadoop dfs -mkdir inputwords
$ bin/hadoop dfs -put conf inputwords
$ bin/hadoop jar hadoop-*-examples.jar wordcount inputwords outputwords
$ bin/hadoop dfs -cat outputwords/*
run wordcount example

$ bin/stop-mapred.sh
$ bin/stop-dfs.sh
To stop the demons

run this cmd on master  

No comments:

Post a Comment