Database Reference
In-Depth Information
The Apache Hadoop Ecosystem
The Apache open source project Hadoop is the traditional and, undoubtedly, most well-accepted Big Data solution
in the industry. Originally developed largely by Google and Yahoo, Hadoop is the most scalable, reliable, distributed-
computing framework available. It's based on Unix/Linux and leverages commodity hardware.
A typical Hadoop cluster might have 20,000 nodes. Maintaining such an infrastructure is difficult both from a
management point of view and a financial one. Initially, only large IT enterprises like Yahoo, Google, and Microsoft
could afford to invest in such Big Data solutions, such as Google search, Bing maps, and so forth. Currently, however,
hardware and storage costs are going so down. This enables small companies or even consumers to think about using
a Big Data solution. Because this topic covers Microsoft HDInsight, which is based on core Hadoop, we will first give
you a quick look at the Hadoop core components and few of its supporting projects.
The core of Hadoop is its storage system and its distributed computing model. This model includes the following
technologies and features:
HDFS: Hadoop Distributed File System is responsible for storing data on the cluster. Data is
split into blocks and distributed across multiple nodes in the cluster.
MapReduce: A distributed computing model used to process data in the Hadoop cluster that
consists of two phases: Map and Reduce. Between Map and Reduce, shuffle and sort occur.
MapReduce guarantees that the input to every reducer is sorted by key. The process by which the system
performs the sort and transfers the map outputs to the reducers as inputs is known as the shuffle . The shuffle is
the heart of MapReduce, and it's where the “magic” happens. The shuffle is an area of the MapReduce logic where
optimizations are made. By default, Hadoop uses Quicksort; afterward, the sorted intermediate outputs get merged
together. Quicksort checks the recursion depth and gives up when it is too deep. If this is the case, Heapsort is used.
You can customize the sorting method by changing the algorithm used via the map.sort.class value in the
hadoop-default.xml file.
The Hadoop cluster, once successfully configured on a system, has the following basic components:
Name Node: This is also called the Head Node of the cluster. Primarily, it holds the metadata
for HDFS. That is, during processing of data, which is distributed across the nodes, the Name
Node keeps track of each HDFS data block in the nodes. The Name Node is also responsible for
maintaining heartbeat co-ordination with the data nodes to identify dead nodes, decommissioning
nodes and nodes in safe mode. The Name Node is the single point of failure in a Hadoop cluster.
Data Node: Stores actual HDFS data blocks. The data blocks are replicated on multiple nodes
to provide fault-tolerant and high-availability solutions.
Job Tracker: Manages MapReduce jobs, and distributes individual tasks.
Task Tracker: Instantiates and monitors individual Map and Reduce tasks.
Additionally, there are a number of supporting projects for Hadoop, each having its unique purpose—for
example, to feed input data to the Hadoop system, to be a data-warehousing system for ad-hoc queries on top of
Hadoop, and many more. Here are a few specific examples worth mentioning:
Hive: A supporting project for the main Apache Hadoop project. It is an abstraction on top of
MapReduce that allows users to query the data without developing MapReduce applications.
It provides the user with a SQL-like query language called Hive Query Language (HQL) to
fetch data from the Hive store.
PIG: An alternative abstraction of MapReduce that uses a data flow scripting language called
PigLatin.
Flume: Provides a mechanism to import data into HDFS as data is generated.
Sqoop: Provides a mechanism to import and export data to and from relational database
tables and HDFS.
 
Search WWH ::




Custom Search