Sunday, November 23, 2014

Setup Hadoop 1.x on Single Node Cluster | Hadoop Training | Hadoop Works...

Online Hadoop Training From DataFlair



http://data-flair.com/course/big-data-and-hadoop/

Email us: info@data-flair.com

Phone: 8451097879



This video covers installation and configuration of Hadoop 1.x or Cloudera CDH3 in pseudo-distributed mode. What are the prerequisites for hadoop. Installation of Java. Setup of password-less ssh. What are the mandatory configurations. Setup Configuration in core-site.xml, hdfs-site.xml, map-red-site.xml. Format name-node. start hadoop services: NameNode, DataNode, secondary-namenode, JobTracker, TaskTracker. Setup environment variables for  Hadoop Submission of Map-Reduce Job. Use web console.



About Big Data - Hadoop Training Course:

An online course designed by Hadoop Experts to provide indepth knowledge and practical skills in the field of Big Data and Hadoop to make you successful Hadoop Developer.



Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Hadoop Cluster- Single and multi node, Hadoop 2.0, Flume, Sqoop, Map-Reduce, PIG, Hive, Hbase, Zookeeper, Oozie etc. will be covered in the course.





Course Objectives:



After the completion of the ‘Big Data and Hadoop’ Course you should be able to:



1. Master the concepts of Hadoop Distributed File System and MapReduce framework

2. Setup a Hadoop on single and multi node Cluster

3. Understand Data Loading Techniques using Sqoop and Flume

4. Program in MapReduce (Both MRv1 and MRv2)

5. Learn to write Complex MapReduce programs

6. Program in YARN (MRv2)

7. Perform Data Analytics using Pig and Hive

8. Implement HBase, MapReduce Integration, Advanced Usage and Advanced Indexing

9. Have a good understanding of ZooKeeper service

10. New features in Hadoop 2.0 — YARN, HDFS Federation, NameNode High Availability

11. Implement best Practices for Hadoop Development and Debugging

12. Implement a Hadoop Project

13. Work on a Real Life Project on Big Data Analytics and gain Hands on Project Experience



Please contact us for more details: info@data-flair.com, 8451097879.



Wednesday, May 14, 2014

Installing YARN (Hadoop 2.4.0) on Ubuntu

Steps to install Hadoop 2.x release (Yarn) on single node cluster setup (pseudo-distributed mode)
Hadoop 2.x release involves many changes to Hadoop and MapReduce. The centralized JobTracker service is replaced with a ResourceManager that manages the resources in the cluster and an ApplicationManager that manages the application lifecycle. These architectural changes enable hadoop to scale to much larger clusters. For more details on architectural changes in Hadoop next-gen (a.k.a. Yarn), watch this video or visit this blog.
This post explains on installing Hadoop 2.x a.k.a. Yarn on a single-node cluster.
Prerequisites:
  1. Java 7 installed
  2. Dedicated user for hadoop (not mandatory)
  3. SSH configured
Steps to install Hadoop 2.x:
1. Download tarball
You can download tarball for hadoop 2.x from http://apache.cs.utah.edu/hadoop/common/current2/ (hadoop-2.4.0.tar.gz ).
Extract it to a folder in your home directory say, $HOME/yarn.
$ cd $HOME/yarn (Optional)


$ sudo chown -R hduser:hadoop hadoop-2.4.0

    READ MORE >>

Tuesday, May 6, 2014

An Insight into Hadoop YARN NodeManager

Conceptually, the NodeManager is more of generic and efficient version of TaskTracker of Hadoop1 architecture which is more flexible and efficient than TaskTracker. In contrast to fixed number of slots for map and reduce tasks in MRV1,the NodeManager of MRV2 has a number of dynamically created resource containers. There is no hardcoded split into Map and Reduce slots as in MRV1.The container refers to collection of resources such as memory,CPU,disk and network IO.The number of containers on a node is the product of configuration parameter and the total amount of node resources. The NodeManager is the per-machine/per-node framework agent who is responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager. Overseeing container’s life cycle management, NodeManager also tracks the health of the node on which it is running, controls auxiliary services which different YARN applications may exploit at any point of time. NodeManager can execute any computations that make sense to ApplicationMaster  just by creating the container for each task.
Source : Hortonworks

The above architecture diagram, gives a detailed view of the NodeManager components.

READ MORE >>


Sunday, May 4, 2014

An Insight into Hadoop Yarn Resource Manager

The Resource Manager is the core component in the Hadoop 2.0 framework (YARN). In analogy, it occupies the place of JobTracker of MRV1. YARN is designed to provide a generic and flexible framework to administer the computing resources in the cluster.
In this direction, the YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes allocation decisions
ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler API is specifically designed to negotiate resources and not schedule tasks. The scheduler does not perform monitoring or tracking of status for the Applications. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. Resource Manager does not guarantee about restarting failed tasks either due to application failure or hardware failures. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in
ApplicationsManager is responsible for maintaining a collection of submitted applications. It accepts job from the client and negotiates for a container to execute the application specific ApplicationMaster and it provide the service for restarting the ApplicationMaster in the case of failure. It also keeps a cache of completed applications so as to serve users’ requests via web UI or command line long after the applications in question finished.