Let us talk about the Hadoop ecosystem and its various components.
After reading this article you will come to know about what is the Hadoop ecosystem and which different components make up the Hadoop ecosystem.
The article explains the Hadoop ecosystem and all its components along with their features.
Hadoop Ecosystem comprises various components such as HDFS, YARN, MapReduce, HBase, Hive, Pig, Zookeeper, Flume, Sqoop, Oozie, and some more.
Hadoop ecosystem is a platform or framework that comprises a suite of various components and services to solve the problem that arises while dealing with big data. It consists of Apache Open Source projects and various commercial tools. The Hadoop ecosystem encompasses different services like (ingesting, storing, analyzing and maintaining) inside it.
Now let us understand each Hadoop ecosystem component in detail:
Components of Hadoop Ecosystem
Hadoop is known for its distributed storage (HDFS). Hadoop Distributed File System is a core component of the Hadoop ecosystem. It serves as a backbone for the Hadoop framework. HDFS enables Hadoop to store huge amounts of data from heterogeneous sources. HDFs stores data of any format either structured, unstructured or semi-structured. It is a java based distributed file system that provides distributed, fault-tolerant, reliable, cost-effective and scalable storage.
HDFS consists of two daemons, that is, NameNode and DataNode.
a. NameNode: NameNode is the master node in HDFS architecture. It keeps the meta-data about the data blocks like locations, permissions, etc. It does not store the actual data. It manages and monitors the DataNode.
b. DataNode: There are multiple DataNodes in the Hadoop cluster. The actual data is stored in DataNode. They are in-expensive commodity hardware responsible for performing processing.
Yet Another Resource Negotiator (YARN) manages resources and schedules jobs in the Hadoop cluster. It was introduced in Hadoop 2.0.
It is designed to split the functionality of job scheduling and resource management into separate daemons. YARN sits in between the HDFS and MapReduce. YARN consists of ResourceManager, NodeManager, and per-application ApplicationMaster.
ResourceManager is the central master node responsible for managing all processing requests. There are multiple NodeMangers. The
ResourceManager interacts with NodeManagers. Each slave DataNode has its own NodeManager for executing tasks. The ApplicationMaster negotiates resources from the ResourceManager. It works with NodeManager(s) for executing and monitoring the tasks.
- It does better resource management.
- Provide Scalability
- Dynamic allocation of cluster resources.
MapReduce is the heart of the Hadoop framework. It is the core component in a Hadoop ecosystem for processing data. MapReduce provides the logic of processing. In simple words, MapReduce is a programming model for writing applications that processes huge amounts of data using distributed and parallel algorithms inside a Hadoop environment.
The MapReduce program consists of two functions that are Map() and Reduce(). The Map function performs filtering, grouping, and sorting. On the other hand, the Reduce function performs aggregation and summarization of the result which are produced by the map function. The input and output of the Map and Reduce function are key-value pairs. The output of the Map function is the input for the Reduce function.
Features of MapReduce
- Simplicity – MapReduce jobs were easy to run. We can write MapReduce applications in any language such as C++, java, python, etc.
- Scalability – Hadoop MapReduce can process petabytes of data.
- Speed – MapReduce process data in a distributed manner thus processing can be done in less time.
- Fault Tolerance – If one copy of data is unavailable, then the other machine has the replica of the same data which can be used for processing the same subtask.
4. Apache Spark
Apache Spark was developed by Apache Software Foundation for performing real-time batch processing at a higher speed. It was developed to meet the growing demands of processing real-time data that can't be handled by the map-reduce task. With its in-memory processing capabilities, it increases the processing speed and optimization.
Apache Spark can easily handle tasks like batch processing, iterative or interactive real-time processing, graph conversions, and visualization.
Features of Apache Spark
- Speed: Spark is 100x times faster than Hadoop for large scale data processing due to its in-memory computing and optimization.
- Ease of Use: It contains many easy to use APIs for operating on large datasets.
- Generality: It is a unified engine that comes packaged with higher-level libraries, that include support for SQL querying, machine learning, streaming data, and graph processing.
- Runs Everywhere: Apache Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Apache Pig is an abstraction over Hadoop MapReduce. Pig is a tool used for analyzing large sets of data. It is generally used with Apache Hadoop. Pig enables us to perform all the data manipulation operations in Hadoop. Pig provides Pig Latin which is a high-level language for writing data analysis programs.
Pig Latin provides various operators that can be used by programmers for developing their own functions for processing, reading, and writing data. For analyzing data using Pig, programmers have to write scripts using Pig Latin. Internally, these scripts are converted into map-reduce tasks. Pig Engine is a component in Apache Pig that accepts Pig Latin scripts as input and converts Latin scripts into Hadoop MapReduce jobs.
Apache Pig enables programmers to perform complex MapReduce tasks without writing complex MapReduce code in java.
Features of Apache Pig
- Rich set of operators: It offers a rich set of operators to programmers for performing operations like sort, join, filer, etc.
- Ease of programming: Pig Latin is very similar to SQL. It is easy for the developer to write a pig script if he/she is familiar with SQL.
- Optimization opportunities: All the tasks in Pig automatically optimize their execution. Thus the programmers have to focus only on the language semantics.
- UDF’s: Pig facilitates programmers to create User-defined Functions in any programming languages and invoke them in Pig Scripts.
- Handles all kinds of data: We can analyze data of any format using Apache Pig. Pig stores result in Hadoop HDFS.
The hive was developed by Facebook to reduce the work of writing MapReduce programs. Apache Hive is an open-source data warehouse system that is used for performing distributed processing and data analyses.
It uses a Hive Query language (HQL) which is a declarative language similar to SQL.
Apache Hive translates all the hive queries into MapReduce programs. Hive supports developers to perform processing and analyses on huge volumes of data by replacing complex java MapReduce programs with hive queries. One who is familiar with SQL commands can easily write the hive queries.
Hive does three functions i.e summarization, query, and the analysis.
Hive is mainly used for data analytics.
Some major components of Hive are:
a. Hive client: Apache Hive provides support for applications written in any programming language like Java, python, Ruby, etc.
Beeline shell: It is the command line shell from which users can submit their queries to the system.
b. HiveServer2: It enables clients to execute its queries against the Hive.
c. Hive compiler: It parses the Hive query. Hive compiler performs type checking and semantic analysis on the different query blocks.
d. Metastore: It is the central repository that stores metadata.
Features of Hive
- Support all primitive data types of SQL
- Support for user-defined function
- Hive provides a tool for ETL operations and adds SQL like capabilities to the Hadoop environment
HBase is an open-source distributed NoSQL database that stores sparse data in tables consisting of billions of rows and columns. It is modeled after Google’s big table and is written in java. HBase provides support for all kinds of data and is built on top of Hadoop. We use HBase when we have to search or retrieve a small amount of data from large volumes of data.
For example: Consider a case in which we are having billions of customer emails. In all these emails we have to find out the customer name who has used the word cancel in their emails. The request required to be processed quickly. For such cases HBase was designed.
The components of HBase are:
a. HBase Master: HBase Master is not a part of the actual data storage. It is responsible for negotiating load balancing across all the RegionServer. It monitors and maintains a Hadoop cluster and controls the failover. HMaster handles DDL operation.
b. RegionServer: RegionServer is the worker node. It handles read, writes, delete, and update requests from the clients. Region server process will run on every node in the Hadoop cluster. It runs on HDFS DateNode.
Features of HBase
- Scalable storage
- Support fault-tolerant feature
- Support for real-time search on sparse data
Hadoop ecosystem provides a table and storage management layer for Hadoop called HCatalog. The users with different data processing tools like Hive, Pig, MapReduce can easily read and write data on the grid using HCatalog. It explores the metadata stored in the meta-store of Hive to all other applications. It allows users to store data in any format and structure. User doesn’t have to worry about in which format the data is stored.
HCatalog supports RCFile, CSV, JSON, sequence file, and ORC file formats by default.
Features of HCatalog
- It enables notifications of data availability.
- HCatalog frees the user from the overhead of data storage and format with table abstraction.
- HCatalog can provide visibility for data cleaning and archiving tools.
Apache Thrift is a software framework from Apache Software Foundation for scalable cross-language services development. It was developed at Facebook. Apache thrift combines the software stack with a code generation engine for building cross-language services. Thrift is an interface definition language for the communication of the Remote Procedure Call. For performance reasons, Apache Thrift is used in the Hadoop ecosystem as Hadoop does a lot of RPC calls.
10. Apache Flume
Apache Flume is an open-source tool for ingesting data from multiple sources into HDFS, HBase or any other central repository. It is a distributed system design for the purpose of moving data from various applications to the Hadoop Distributed File System. Using Flume, we can collect, aggregate, and move streaming data ( example log files, events) from web servers to centralized stores. Apache Flume acts as a courier server between various data sources and HDFS. Apache Flume transfers data generated by various sources such as social media platforms, e-commerce sites, etc. into Hadoop storage.
Features of Apache Flume
- Apache Flume is a scalable, extensible, fault-tolerant, and distributed service.
- Apache Flume has a simple and flexible architecture.
- Apache Flume is horizontally scalable.
- Apache Flume has the flexibility of collecting data in batch or real-time mode.
11. Apache Sqoop
Apache Sqoop is another data ingestion tool. It is designed for transferring data between relational databases and Hadoop. It is used for importing data to and exporting data from relational databases. Most enterprises store data in RDBMS, so Sqoop is used for importing that data into Hadoop distributed storage for analyses.
The database admins and the developers can use the command-line interface for importing and exporting data. Apache Sqoop converts these commands into MapReduce format and sends them to the Hadoop Distributed FileSystem using YARN. The Sqoop import tool imports individual tables from relational databases to HDFS. The Sqoop export tool exports the set of files from the Hadoop Distributed FileSystem back to an RDBMS.
Features of Sqoop
- It supports compression.
- Sqoop is fault-tolerant.
- Sqoop can perform concurrent operations like Apache Flume.
- Sqoop supports Kerberos authentication.
Oozie is a scheduler system that runs and manages Hadoop jobs in a distributed environment. Oozie allows for combining multiple complex jobs and allows them to run in a sequential manner for achieving bigger tasks. It is a Java Web-Application. Oozie is open source and available under Apache license 2.0. Apache Oozie is tightly integrated with the Hadoop stack. It supports all Hadoop jobs like Pig, Sqoop, Hive, and system-specific jobs such as Shell and Java.
Oozie triggers workflow actions, which in turn use the Hadoop execution engine for actually executing the task.
There are two kinds of Oozie jobs:
a. Oozie workflow: The Oozie workflow is the sequential set of actions that are to be executed. We can assume this as a relay race.
b. Oozie Coordinator: The Oozie Coordinator are the Oozie jobs that are triggered when the data is available to it. We can assume it as the response-stimuli system in our body. Oozie Coordinator responds to the availability of data and rests otherwise.
Features of OOzie
- Oozie can leverage existing Hadoop systems for fail-over, load balancing, etc.
- It detects task completion via callback and polling.
- It is extensible, scalable, and reliable.
Avro is an open-source project. Avro provides data exchange and data serialization services to Apache Hadoop. Both of these services can be either used independently or together. Avro provides the facility of exchanging big data between programs that are written in any language. With the Avro serialization service, the programs efficiently serialize data into the files or into the messages.
It stores data definitions as well as data together in one file or message. The data definition stored by Avro is in JSON format. This makes it easy to read and interpret. The data stored by Avro is in a binary format that makes it compact and efficient.
Features of Avro
- Rich data structures.
- Remote procedure call.
- Compact, fast, binary data format.
- A container file, to store persistent data.
14. Apache Ambari
Apache Ambari is an open-source project that aims at making management of Hadoop simpler by developing software for managing, monitoring, and provisioning Hadoop clusters. It is an administration tool that is deployed on the top of Hadoop clusters. Ambari keeps track of the running applications and their status. It provides an easy-to-use Hadoop cluster management web User Interface backed by its RESTful APIs.
It allows a wide range of tools such as Hive, MapReduce, Pig, etc. to be installed on the Hadoop cluster and manages and monitors their performance.
Features of Apache Ambari
- It is flexible.
- Adaptive technology thus fits well in the enterprise environment.
- Provide authentication, authorization, and auditing through Kerberos.
- User-friendly configuration.
15. Apache Drill
Apache Drill is another most important Hadoop ecosystem component. The main purpose of Apache Drill is large-scale processing of structured as well as semi-structured data. Apache Drill is a low latency distributed query engine. It is scalable and can scale to several thousands of nodes. It can query petabytes of data. Apache Drill has a schema-free model.
Features of Apache Drill
- It has a specialized memory management system for eliminating garbage collection and optimizing memory usage.
- It allows the reuse of existing Hive deployment to the developers.
- Apache Drill provides an extensible and flexible architecture at all layers including query optimization, query layer, and client API.
- Apache Drill provides a hierarchical columnar data model for representing highly dynamic, complex data.
- It allows for efficient processing.
16. Apache Zookeeper
Apache Zookeeper is a Hadoop Ecosystem component for managing configuration information, providing distributed synchronization, naming, and group services. Zookeeper is used by groups of nodes for coordination amongst themselves and for maintaining shared data through robust synchronization techniques. ZooKeeper is a distributed application providing services for writing a distributed application.
Before the development of Zookeeper, it was really very difficult and time consuming for maintaining coordination between various services in the Hadoop Ecosystem. Zookeeper makes coordination easier and saves a lot of time through synchronization, grouping and naming, configuration maintenance.
Features of Zookeeper
- Zookeeper is fast with workloads.
- It maintains a record of all the transactions.
- It offers atomicity that a transaction would either complete or fail, the transactions are not partially done.
17. Solr & Lucene
The Apache Solr and Apache Lucene are the two services in the Hadoop Ecosystem. They are used for searching and indexing. Lucene is based on Java and helps in spell checking. If Apache Lucene is the engine that Apache Solr is the car that builds around the engine. Thus, Apache Solr is the complete application that is built around Apache Lucene. It uses Lucene java library for searching and indexing.
18. Apache Mahout
It is an open-source top-level project at Apache. It is used for building scalable machine learning algorithms. Apache Mahout implements various popular machine learning algorithms like Clustering, Classification, Collaborative Filtering, Recommendation, etc. The Apache Mahout does:
a. Collaborative filtering: Apache Mahout mines user behaviors, user patterns, and user characteristics. And on the basis of this, it predicts and provides recommendations to the users. E-commerce websites are typical use-case.
b. Clustering: Apache Mahout organizes all similar groups of data together.
c. Classification: Classification means classifying and categorizing data into several sub-departments. For example, Apache Mahout can be used for categorizing articles into blogs, essays, news, research papers, etc.
d. Frequent itemset missing: Here Apache Mahout checks for the objects which are likely to be appearing together. It makes suggestions if objects are missing. For example, if we search for mobile then it will also recommend mobile cover because in general mobile and mobile cover are brought together.
Features of Apache Mahout
- It works well in a distributed environment.
- It scales effectively in the cloud infrastructure.
- Apache Mahout offers a ready-to-use framework to its coder for doing data mining tasks.
- It lets applications analyze huge data sets effectively in a quick time.
I hope after reading this article, you clearly understand what is the Hadoop ecosystem and what are its different components. Hadoop ecosystem comprises many open-source projects for analyzing data in batch as well as real-time mode. In the Hadoop ecosystem, there are many tools that offer different services. These Hadoop Ecosystem components empower Hadoop functionality.