Showing posts with label Cloudera Hadoop. Show all posts
Showing posts with label Cloudera Hadoop. Show all posts

Getting MapReduce 2 Up to Speed

Apache Hadoop is no exception to this rule. Recently, Cloudera engineers set out to ensure that MapReduce performance in Hadoop 2 (MR2/YARN) is on par with, or better than, MapReduce performance in Hadoop 1 (MR1). Architecturally, MR2 has many performance advantages over MR1:

Better scalability by splitting the JobTracker into the ResourceManager and Application Masters.
Better cluster utilization and higher throughput through finer-grained resource scheduling.
Less tuning required to avoid over-spilling from smarter sort buffer management.
Faster completion times for small jobs through “Uber Application Masters,” which run all of a job’s tasks in a single JVM.
While these improvements are important, none of them mean particularly much for well-tuned medium-sized jobs on medium-sized clusters. Whenever a codebase goes through large changes, regressions are likely to seep in.

While correctness issues are easy to spot, performance regressions are difficult to catch without rigorous measurement. When we started including MR2 in our performance measurements last year, we noticed that it lagged behind MR1 significantly on nearly every benchmark. Since then, we’ve done a ton of work — tuning parameters in Cloudera Manager and fixing regressions in MapReduce itself — and can now proudly say that CDH 5 MR2 performs equally well, or better than, MR1 on all our benchmarks.

In this post, I’ll offer a couple examples of this work as case studies in tracking down the performance regressions of complex (Java) distributed systems.

Ensuring a Fair Comparison

Ensuring a fair comparison between MR1 and MR2 is tricky. One common pitfall is that TeraSort, the job most commonly used for benchmarking, changed between MR1 and MR2. To reflect rule changes in the GraySort benchmark on which it is based, the data generated by the TeraSort included with MR2 is less compressible. A valid comparison would use the same version of TeraSort for both releases; otherwise, MR1 will have an unfair advantage.

Another difficult area is resource configuration. In MR1, each node’s resources must be split between slots available for map tasks and slots available for reduce tasks. In MR2, the resource capacity configured for each node is available to both map and reduce tasks. So, if you give MR1 nodes 8 map slots and 8 reduce slots and give MR2 nodes 16 slots worth of memory, resources will be underutilized during MR1’s map phase. MR2 will be able to run 16 concurrent mappers per node while MR1 will only be able to run 8. If you only give MR2 nodes 8 slots of memory, then MR2 will suffer during the period when the map and reduce phases overlap – it will only get to run 8 tasks concurrently, while MR1 will be able to run more. (See this post for more information about properly configuring MR2.)

To circumvent this issue, our benchmarks give full node capacity in MR1 to both map slots and reduce slots. We then set the mapred.reduce.slowstart.completedmaps parameter in both to .99, meaning that there will be no overlap between the map and reduce phases. This ensures that MR1 and MR2 get full cluster resources for both phases.

Read More

Hadoop Troubleshooting-001

Hadoop getting following error (java.lang.UnsupportedClassVersionError) when run Hadoop / HDFS Command

bash-3.00$ hadoop  dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

Solution:
This is typical java error, which is caused when you run a class that is compiled with newer java version then you have currently in your system. If you are running Hadoop with older version (older than 1.6) you will get this error. Please check the version of java (java -version) and point Hadoop to correct java version.

Big Data Use-cases Across Industries, It's Becoming Bigger and Bigger

Big Data will become bigger and bigger, According to certain market forecasts it will increase by 1211%.
Big Data is increase at very fast speed, as the volume of data is growing like any thing, according to Gartner we create 2.5 Quintillion bytes of data per day (90% of the data in the world has been created in last two years alone).
An interesting infographic from Wikibon.org


Image source: wikibon.org/

Hbase at Facebook

After understanding basics of HBase, let’s try to understand How Facebook uses HBase, I have got very good tutorial from Facebook, how they are using HBase for messeging. This tutorial includes Introduction to Hbase, Why Hbase, MySQL to HbaseMigration at Facebook


 

Hbase-A Soft Introduction & Quickstart

After understanding whats is Hadoop, and after deploying hadoop , lets’ start understanding HBase. This tutorial explains basics of HBase, and its features. Here I tried to explain functionality HBase provides and a quick start about HBase, a Basic tutorial for beginners. You will get to know where to use HBase, in which situation HBase can be useful.


Apache HBase
Source: Apache
Understanding What is HBase
HBase is an open source, distributed, versioned, column-oriented, No-SQL / Non-relational database management system that runs on the top of Hadoop. It adds transactional capability to hadoop, allowing users to update data records. Hadoop is designed for batch processing of large dataset, but with HBase on the top of Hadoop we can process real time dataset.

Optimize Map Reduce Job Performance

Optimize Hadoop Performance. To improve Hadoop performance, you need to change various configuration parameter in core-site.xml, hdfs-site.xml, mapred-site.xml. The configuration / optimization of parameter to improve performance depends on the type of processing, it depends on case to case, there is no hard and fast rule.

To install Hadoop on ubuntu cluster you can refer this post

We can change block size, number of mappers and reducers, sort factor, jvm reuse, memory for java process, enable compression, map output compression, use combiner, etc.
I found a very nice description given by Cloudera



Deploy Apache Flume NG (1.x.x)

In this tutorial I have explained how to install / deploy / configure Flume NG on single system, and how to configure Flume NG to copy data to HDFS, then configuration for copying data to HBase

Before going to configurations let’s understand what Flume NG (1.x.x) is:
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application.





Flume-Solr Integration

Integrate Flume with Solr. I have created a new sink. This sink is usually used with the regexAll decorators that perform light transformation of event data into attributes. This attributes are converted into solr document and commited in solr.


What is Solr
Solr is an open source enterprise search server based on Lucene. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language.


What is Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management.


I have used flume-0.9.3 and apache-solr-3.1.0 for this POC.


RegexAllExtractor decorator prepare events that contain attributes ready to be written into an Solr. Implementing a RegexAllExtractor decorator is very simple.


S3 instead of HDFS with Hadoop

In this article we will discuss about using S3 as replacement of HDFS (Hadoop Distributed File System) on AWS (Amazon Web Services), and also about what is the need of using S3. Before coming to original use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3

Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)

S3 as Input or Output for Hadoop MR jobs


How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.

What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.

Hue Features

This Hue Tutorial will describe features of Cloudera Hue

Cloudera Hue is a web based UI for hadoop. It was started as Cloudera Desktop later named to hue. Hue provides following features:
  • Beeswax
  • File Browser
  • Job Designer
  • Job Browser
  • User Admin

Hue Installation and Configuration

This section describes instructions for cloudera Hue installation and change its default configuration like configure other database with hue and send notification/email of job completion etc...


Installing Hue on one machine with CDH in pseudo-distributed mode:


To Install Hue:
  • With this single command hue will get installed

$ sudo apt-get install hue

Running Cloudera in Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Distributed mode (multi node cluster)

Prerequisite: Before starting Cloudera in distributed mode you must setup Cloudera in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine(cluster) on a single machine).

Running Cloudera in Pseudo Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Pseudo distributed mode (single node cluster)

Following steps tested on
Hadoop: CDH (Cloudera Distribution of Apache Hadoop)
OS: Ubuntu

Running Cloudera in Standalone Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Standalone mode(single node cluster)

Following steps tested on
Hadoop: CDH (Cloudera Distribution of Apache Hadoop)
OS: Ubuntu

Understanding What is Hadoop


What is Hadoop:
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets (In the range of terabytes to zetabytes).