Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Optimize Map Reduce Job Performance

Optimize Hadoop Performance. To improve Hadoop performance, you need to change various configuration parameter in core-site.xml, hdfs-site.xml, mapred-site.xml. The configuration / optimization of parameter to improve performance depends on the type of processing, it depends on case to case, there is no hard and fast rule.

To install Hadoop on ubuntu cluster you can refer this post

We can change block size, number of mappers and reducers, sort factor, jvm reuse, memory for java process, enable compression, map output compression, use combiner, etc.
I found a very nice description given by Cloudera

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)