Introducing HDInsight - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

The Apache Hadoop Ecosystem

The Apache open source project Hadoop is the traditional and, undoubtedly, most well-accepted Big Data solution

in the industry. Originally developed largely by Google and Yahoo, Hadoop is the most scalable, reliable, distributed-

computing framework available. It's based on Unix/Linux and leverages commodity hardware.

A typical Hadoop cluster might have 20,000 nodes. Maintaining such an infrastructure is difficult both from a

management point of view and a financial one. Initially, only large IT enterprises like Yahoo, Google, and Microsoft

could afford to invest in such Big Data solutions, such as Google search, Bing maps, and so forth. Currently, however,

hardware and storage costs are going so down. This enables small companies or even consumers to think about using

a Big Data solution. Because this topic covers Microsoft HDInsight, which is based on core Hadoop, we will first give

you a quick look at the Hadoop core components and few of its supporting projects.

The core of Hadoop is its storage system and its distributed computing model. This model includes the following

technologies and features:

• HDFS: Hadoop Distributed File System is responsible for storing data on the cluster. Data is

split into blocks and distributed across multiple nodes in the cluster.

• MapReduce: A distributed computing model used to process data in the Hadoop cluster that

consists of two phases: Map and Reduce. Between Map and Reduce, shuffle and sort occur.

MapReduce guarantees that the input to every reducer is sorted by key. The process by which the system

performs the sort and transfers the map outputs to the reducers as inputs is known as the shuffle . The shuffle is

the heart of MapReduce, and it's where the “magic” happens. The shuffle is an area of the MapReduce logic where

optimizations are made. By default, Hadoop uses Quicksort; afterward, the sorted intermediate outputs get merged

together. Quicksort checks the recursion depth and gives up when it is too deep. If this is the case, Heapsort is used.

You can customize the sorting method by changing the algorithm used via the map.sort.class value in the

hadoop-default.xml file.

The Hadoop cluster, once successfully configured on a system, has the following basic components:

• Name Node: This is also called the Head Node of the cluster. Primarily, it holds the metadata

for HDFS. That is, during processing of data, which is distributed across the nodes, the Name

Node keeps track of each HDFS data block in the nodes. The Name Node is also responsible for

maintaining heartbeat co-ordination with the data nodes to identify dead nodes, decommissioning

nodes and nodes in safe mode. The Name Node is the single point of failure in a Hadoop cluster.

• Data Node: Stores actual HDFS data blocks. The data blocks are replicated on multiple nodes

to provide fault-tolerant and high-availability solutions.

• Job Tracker: Manages MapReduce jobs, and distributes individual tasks.

• Task Tracker: Instantiates and monitors individual Map and Reduce tasks.

Additionally, there are a number of supporting projects for Hadoop, each having its unique purpose—for

example, to feed input data to the Hadoop system, to be a data-warehousing system for ad-hoc queries on top of

Hadoop, and many more. Here are a few specific examples worth mentioning:

• Hive: A supporting project for the main Apache Hadoop project. It is an abstraction on top of

MapReduce that allows users to query the data without developing MapReduce applications.

It provides the user with a SQL-like query language called Hive Query Language (HQL) to

fetch data from the Hive store.

• PIG: An alternative abstraction of MapReduce that uses a data flow scripting language called

PigLatin.

• Flume: Provides a mechanism to import data into HDFS as data is generated.

• Sqoop: Provides a mechanism to import and export data to and from relational database

tables and HDFS.

Pro Microsoft HDInsight: Hadoop on Windows

Search WWH ::

Custom Search

Home