Friday, February 21, 2014

Getting MapReduce 2 Up to Speed

Apache Hadoop is no exception to this rule. Recently, Cloudera engineers set out to ensure that MapReduce performance in Hadoop 2 (MR2/YARN) is on par with, or better than, MapReduce performance in Hadoop 1 (MR1). Architecturally, MR2 has many performance advantages over MR1:

Better scalability by splitting the JobTracker into the ResourceManager and Application Masters.
Better cluster utilization and higher throughput through finer-grained resource scheduling.
Less tuning required to avoid over-spilling from smarter sort buffer management.
Faster completion times for small jobs through “Uber Application Masters,” which run all of a job’s tasks in a single JVM.
While these improvements are important, none of them mean particularly much for well-tuned medium-sized jobs on medium-sized clusters. Whenever a codebase goes through large changes, regressions are likely to seep in.

While correctness issues are easy to spot, performance regressions are difficult to catch without rigorous measurement. When we started including MR2 in our performance measurements last year, we noticed that it lagged behind MR1 significantly on nearly every benchmark. Since then, we’ve done a ton of work — tuning parameters in Cloudera Manager and fixing regressions in MapReduce itself — and can now proudly say that CDH 5 MR2 performs equally well, or better than, MR1 on all our benchmarks.

In this post, I’ll offer a couple examples of this work as case studies in tracking down the performance regressions of complex (Java) distributed systems.

Ensuring a Fair Comparison

Ensuring a fair comparison between MR1 and MR2 is tricky. One common pitfall is that TeraSort, the job most commonly used for benchmarking, changed between MR1 and MR2. To reflect rule changes in the GraySort benchmark on which it is based, the data generated by the TeraSort included with MR2 is less compressible. A valid comparison would use the same version of TeraSort for both releases; otherwise, MR1 will have an unfair advantage.

Another difficult area is resource configuration. In MR1, each node’s resources must be split between slots available for map tasks and slots available for reduce tasks. In MR2, the resource capacity configured for each node is available to both map and reduce tasks. So, if you give MR1 nodes 8 map slots and 8 reduce slots and give MR2 nodes 16 slots worth of memory, resources will be underutilized during MR1’s map phase. MR2 will be able to run 16 concurrent mappers per node while MR1 will only be able to run 8. If you only give MR2 nodes 8 slots of memory, then MR2 will suffer during the period when the map and reduce phases overlap – it will only get to run 8 tasks concurrently, while MR1 will be able to run more. (See this post for more information about properly configuring MR2.)

To circumvent this issue, our benchmarks give full node capacity in MR1 to both map slots and reduce slots. We then set the mapred.reduce.slowstart.completedmaps parameter in both to .99, meaning that there will be no overlap between the map and reduce phases. This ensures that MR1 and MR2 get full cluster resources for both phases.

Read More

Facebook unveils Presto engine for querying 250 PB data warehouse

As Facebook’s user base swells larger and larger, data about users is growing much faster, and the social networking giant has developed a faster way to analyze it all with its Presto query engine.

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

Read More

Thursday, February 20, 2014

Hadoop Troubleshooting-002

How to resolve java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell error.

bash-3.00$ hadoop dfs -ls
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell

Run following command and check whether jar hadoop-common-2.0.2-alpha-gphd- is in classpath.
$hadoop classpath

if given jar is not in the classpath then add following entry in
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:YOUR-PATH/hadoop-2.0.2-alpha-gphd-

Hadoop Troubleshooting-001

Hadoop getting following error (java.lang.UnsupportedClassVersionError) when run Hadoop / HDFS Command

bash-3.00$ hadoop  dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(
        at Method)
        at java.lang.ClassLoader.loadClass(
        at sun.misc.Launcher$AppClassLoader.loadClass(
        at java.lang.ClassLoader.loadClass(
        at java.lang.ClassLoader.loadClassInternal(

This is typical java error, which is caused when you run a class that is compiled with newer java version then you have currently in your system. If you are running Hadoop with older version (older than 1.6) you will get this error. Please check the version of java (java -version) and point Hadoop to correct java version.

Big Data Use-cases Across Industries, It's Becoming Bigger and Bigger

Big Data will become bigger and bigger, According to certain market forecasts it will increase by 1211%.
Big Data is increase at very fast speed, as the volume of data is growing like any thing, according to Gartner we create 2.5 Quintillion bytes of data per day (90% of the data in the world has been created in last two years alone).
An interesting infographic from

Image source: