Wednesday, May 14, 2014

Installing YARN (Hadoop 2.4.0) on Ubuntu

Steps to install Hadoop 2.x release (Yarn) on single node cluster setup (pseudo-distributed mode)
Hadoop 2.x release involves many changes to Hadoop and MapReduce. The centralized JobTracker service is replaced with a ResourceManager that manages the resources in the cluster and an ApplicationManager that manages the application lifecycle. These architectural changes enable hadoop to scale to much larger clusters. For more details on architectural changes in Hadoop next-gen (a.k.a. Yarn), watch this video or visit this blog.
This post explains on installing Hadoop 2.x a.k.a. Yarn on a single-node cluster.
  1. Java 7 installed
  2. Dedicated user for hadoop (not mandatory)
  3. SSH configured
Steps to install Hadoop 2.x:
1. Download tarball
You can download tarball for hadoop 2.x from (hadoop-2.4.0.tar.gz ).
Extract it to a folder in your home directory say, $HOME/yarn.
$ cd $HOME/yarn (Optional)

$ sudo chown -R hduser:hadoop hadoop-2.4.0


Tuesday, May 6, 2014

An Insight into Hadoop YARN NodeManager

Conceptually, the NodeManager is more of generic and efficient version of TaskTracker of Hadoop1 architecture which is more flexible and efficient than TaskTracker. In contrast to fixed number of slots for map and reduce tasks in MRV1,the NodeManager of MRV2 has a number of dynamically created resource containers. There is no hardcoded split into Map and Reduce slots as in MRV1.The container refers to collection of resources such as memory,CPU,disk and network IO.The number of containers on a node is the product of configuration parameter and the total amount of node resources. The NodeManager is the per-machine/per-node framework agent who is responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager. Overseeing container’s life cycle management, NodeManager also tracks the health of the node on which it is running, controls auxiliary services which different YARN applications may exploit at any point of time. NodeManager can execute any computations that make sense to ApplicationMaster  just by creating the container for each task.
Source : Hortonworks

The above architecture diagram, gives a detailed view of the NodeManager components.


Sunday, May 4, 2014

An Insight into Hadoop Yarn Resource Manager

The Resource Manager is the core component in the Hadoop 2.0 framework (YARN). In analogy, it occupies the place of JobTracker of MRV1. YARN is designed to provide a generic and flexible framework to administer the computing resources in the cluster.
In this direction, the YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes allocation decisions
ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler API is specifically designed to negotiate resources and not schedule tasks. The scheduler does not perform monitoring or tracking of status for the Applications. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. Resource Manager does not guarantee about restarting failed tasks either due to application failure or hardware failures. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in
ApplicationsManager is responsible for maintaining a collection of submitted applications. It accepts job from the client and negotiates for a container to execute the application specific ApplicationMaster and it provide the service for restarting the ApplicationMaster in the case of failure. It also keeps a cache of completed applications so as to serve users’ requests via web UI or command line long after the applications in question finished.

Saturday, May 3, 2014

A soft Introduction to YARN

YARN : The Next Generation Hadoop Framework
MapReduce framework for processing Bigdata has undergone a complete change in its architecture and functionalities from hadoop-0.23 onwards. MapReduce 2.0 (MRv2) or YARN (Yet Another Resource Negotiator) is the nomenclature with which it is popularly known.
There has been a significant change in terms of resource management and application management along with solutions for problems in MRV1, like under utilization of cluster resources, single point of failure of Namenode etc. Hadoop2 (YARN) allows workloads to share cluster resources dynamically between varieties of processing frameworks like MapReduce.