Showing posts with label Data Warehouse. Show all posts
Showing posts with label Data Warehouse. Show all posts

Getting MapReduce 2 Up to Speed

Apache Hadoop is no exception to this rule. Recently, Cloudera engineers set out to ensure that MapReduce performance in Hadoop 2 (MR2/YARN) is on par with, or better than, MapReduce performance in Hadoop 1 (MR1). Architecturally, MR2 has many performance advantages over MR1:

Better scalability by splitting the JobTracker into the ResourceManager and Application Masters.
Better cluster utilization and higher throughput through finer-grained resource scheduling.
Less tuning required to avoid over-spilling from smarter sort buffer management.
Faster completion times for small jobs through “Uber Application Masters,” which run all of a job’s tasks in a single JVM.
While these improvements are important, none of them mean particularly much for well-tuned medium-sized jobs on medium-sized clusters. Whenever a codebase goes through large changes, regressions are likely to seep in.

While correctness issues are easy to spot, performance regressions are difficult to catch without rigorous measurement. When we started including MR2 in our performance measurements last year, we noticed that it lagged behind MR1 significantly on nearly every benchmark. Since then, we’ve done a ton of work — tuning parameters in Cloudera Manager and fixing regressions in MapReduce itself — and can now proudly say that CDH 5 MR2 performs equally well, or better than, MR1 on all our benchmarks.

In this post, I’ll offer a couple examples of this work as case studies in tracking down the performance regressions of complex (Java) distributed systems.

Ensuring a Fair Comparison

Ensuring a fair comparison between MR1 and MR2 is tricky. One common pitfall is that TeraSort, the job most commonly used for benchmarking, changed between MR1 and MR2. To reflect rule changes in the GraySort benchmark on which it is based, the data generated by the TeraSort included with MR2 is less compressible. A valid comparison would use the same version of TeraSort for both releases; otherwise, MR1 will have an unfair advantage.

Another difficult area is resource configuration. In MR1, each node’s resources must be split between slots available for map tasks and slots available for reduce tasks. In MR2, the resource capacity configured for each node is available to both map and reduce tasks. So, if you give MR1 nodes 8 map slots and 8 reduce slots and give MR2 nodes 16 slots worth of memory, resources will be underutilized during MR1’s map phase. MR2 will be able to run 16 concurrent mappers per node while MR1 will only be able to run 8. If you only give MR2 nodes 8 slots of memory, then MR2 will suffer during the period when the map and reduce phases overlap – it will only get to run 8 tasks concurrently, while MR1 will be able to run more. (See this post for more information about properly configuring MR2.)

To circumvent this issue, our benchmarks give full node capacity in MR1 to both map slots and reduce slots. We then set the mapred.reduce.slowstart.completedmaps parameter in both to .99, meaning that there will be no overlap between the map and reduce phases. This ensures that MR1 and MR2 get full cluster resources for both phases.

Read More

Facebook unveils Presto engine for querying 250 PB data warehouse

As Facebook’s user base swells larger and larger, data about users is growing much faster, and the social networking giant has developed a faster way to analyze it all with its Presto query engine.

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

Read More

Hadoop Troubleshooting-001

Hadoop getting following error (java.lang.UnsupportedClassVersionError) when run Hadoop / HDFS Command

bash-3.00$ hadoop  dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

Solution:
This is typical java error, which is caused when you run a class that is compiled with newer java version then you have currently in your system. If you are running Hadoop with older version (older than 1.6) you will get this error. Please check the version of java (java -version) and point Hadoop to correct java version.

solr-vs-elasticsearch

In this tutorial we will understand the difference between Apche Solr and Elastic search, Before going to actual post lets understand what is Apache Solr and what is Elastic search

Apache Solr Introduction:
Solr is an open source enterprise search server based on Lucene. Developing a high performance, feature rich application that uses Lucene directly is difficult and it’s limited to Java applications. Solr solves this by exposing the wealth of power in Lucene via configuration files and HTTP parameters, while adding some features of its own. Configuration files, most notably for the index’s schema, which defines the fields and configuration of their text analysis.

Elastic Search Introduction: ElasticSearch is a distributed, RESTful, free/open source search server based on Apache Lucene. It is developed by Shay Banon[1] and is released under the terms of the Apache License. ElasticSearch is developed in Java.

To Read complete post please refer http://solr-vs-elasticsearch.com/

Deploy Hadoop Cluster

Step by Step Tutorial to Deploy Hadoop Cluster (fully distributed mode):
To setup Hadoop in cluster (distributed cluster) requires multiple machines/nodes, one node will act as master and rest all will act as slaves.
If you want Hadoop quick introduction please click here.
If you want to setup hadoop in pseudo distributed mode please click here

In this tutorial:
  • I am using 3 nodes, 1 master 2 slaves
  • I am using Cloudera distribution for Apache hadoop CDH3U3 (you can use Apache hadoop (0.20.X) also)
  • I am deploying hadoop on ubuntu (you can use other OS (cent OS, Redhat, etc))

Install / Setup Hadoop on cluster

Install Hadoop on master:

1. Add entry of master and slaves in hosts file:
Edit hosts file and following add entries
$ sudo pico /etc/hosts
MASTER-IP    master
SLAVE01-IP   slave01
SLAVE02-IP   slave02
(In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of corresponding IP)

Understanding Greenplum

This blog will guide you to understand Greenplum that includes what is Greenplum, its architecture, different segments, and its basics in details. In this Greenplum tutorial we will try to understand the capability and the architecture provided by Greenplum
 
What is Greenplum: Greenplum Database is a massively parallel processing (MPP) database server based on PostgreSQL open-source technology. MPP (also known as a shared nothing architecture) refers to systems with two or more processors which cooperate to carry out an operation - each processor with its own memory, operating system and disks.
Source: Greenplum
 
Greenplum Database Architecture: Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture). The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates
Source: Greenplum
 
 
Is Greenplum free: Greenplum is not free to setup for production or to setup cluster. But it has launched a community edition which is free but with pre specified guidelines.
 
What is Community edition: The EMC Greenplum Community Edition (CE) provides a powerful and comprehensive analytic environment enabling users to turn increasingly large amounts of data into useful insight. Developers, data scientists and other data professionals can experiment with real-world data, perform advanced analytics and most importantly - rapidly reveal insights from big data sets with ease
 
Parallel Query Optimizer: "Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan." It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.
 
Parallel Dataflow Engine:  At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner
 Source: Greenplum
 
 
Greenplum Database support (hardware requirements): Greenplum Database supported for production use on SUSE Linux Enterprise Server 10.2 (64-bit),  Red Hat Enterprise Linux 5.x (64-bit), CentOS Linux 5.x (64-bit) and Sun Solaris 10U5+ (64-bit). Greenplum Database 3.3 is supported on server hardware from a range of vendors including HP, Dell, Sun and IBM. Greenplum Database is supported for non-production (development and evaluation) use on Mac OSX 10.5, Red Hat Enterprise Linux 5.2 or higher (32-bit) and CentOS Linux 5.2 or higher (32-bit).
 
Greenplum Master Segment: "The master is the entry point to the Greenplum Database system. It is the database process that accepts client connections and processes the SQL commands issued by the users of the system". The master is where the global system catalog resides (the set of system tables that contain metadata about the Greenplum Database system itself), however the master does not contain any user data. Data resides only on the segments. The master does the work of authenticating client connections, processing the incoming SQL commands, distributing the work load between the segments, coordinating the results returned by each of the segments, and presenting the final results to the client program
 
Greenplum Primary Segments: In Greenplum Database, the segments are where the data is stored and where the majority of query processing takes place. User-defined tables and their indexes are distributed across the available number of segments in the Greenplum Database system, each segment containing a distinct portion of the data. Segment instances are the database server processes that serve segments. Users do not interact directly with the segments in a Greenplum Database system, but do so through the master.
 
Mirror Segment:Mirror segments allow database queries to fail over to a backup segment if the primary segment becomes unavailable. Mirror / secondary segment always resides on a different host than its primary
 
Source: Greenplum
 
 

Create an AMI


This blog will guide you through creating an Ubuntu AMI (Amazon Machine Image) from a launched Instance. In this tutorial we will create S3 backed AMI from running instance (Ubuntu). Before getting down to create an actual AMI let’s try to understand some basic terminologies:

Understand what AMI is: An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2. We can say that AMI is an image from which an instance can boot.

What is Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers
Create your own AMI, so that you can boot new custom instance which have all the required software preinstalled. Your AMI becomes basic unit of deployment; it will save your time of installing required software again and again.

Understanding What is Hadoop


What is Hadoop:
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets (In the range of terabytes to zetabytes).