Facebook unveils Presto engine for querying 250 PB data warehouse

As Facebook’s user base swells larger and larger, data about users is growing much faster, and the social networking giant has developed a faster way to analyze it all with its Presto query engine.

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

Read More

Hadoop Troubleshooting-002

How to resolve java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell error.


bash-3.00$ hadoop dfs -ls
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell

Solution:
Run following command and check whether jar hadoop-common-2.0.2-alpha-gphd-2.0.1.0.jar is in classpath.
$hadoop classpath


if given jar is not in the classpath then add following entry in hadoop-env.sh
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:YOUR-PATH/hadoop-2.0.2-alpha-gphd-2.0.1.0/share/hadoop/common/hadoop-common-2.0.2-alpha-gphd-2.0.1.0.jar


Hadoop Troubleshooting-001

Hadoop getting following error (java.lang.UnsupportedClassVersionError) when run Hadoop / HDFS Command

bash-3.00$ hadoop  dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

Solution:
This is typical java error, which is caused when you run a class that is compiled with newer java version then you have currently in your system. If you are running Hadoop with older version (older than 1.6) you will get this error. Please check the version of java (java -version) and point Hadoop to correct java version.