Storing and Configuring Data with Hadoop, YARN, and ZooKeeper - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

•

It offers resilience via data replication.

•

It offers automatic failover in the event of a crash.

•

It automatically fragments storage over the cluster.

•

It brings processing to the data.

•

Its supports large volumes of files—into the millions.

The third point comes with a caveat: Hadoop V1 has problems with very large scaling. At the time of writing, it is

limited to a cluster size of around 4,000 nodes and 40,000 concurrent tasks. Hadoop V2 was developed in part to offer

better resource usage and much higher scaling.

Using Hadoop V2 as an example, you see that there are four main component parts to Hadoop. Hadoop Common

is a set of utilities that support Hadoop as a whole. Hadoop Map Reduce is the parallel processing system used by

Hadoop. It involves the steps Map, Shuffle, and Reduce. A big volume of data (the text of this topic, for example) is

mapped into smaller elements (the individual words), then an operation (say, a word count) is carried out locally

on the small elements of data. These results are then shuffled i nto a whole, and reduced to a single list of words and

their counts. Hadoop YARN handles scheduling and resource management. Finally, Hadoop Distributed File System

(HDFS) is the distributed file system that works on a master/slave principle whereby a name node manages a cluster

of slave data nodes.

The Hadoop V1 Architecture

In the V1 architecture, a master Job Tracker is used to manage Task Trackers on slave nodes (Figure 2-1 ). Hadoop's

data node and Task Trackers co-exist on the same slave nodes.

Figure 2-1. Hadoop V1 architecture

Search WWH ::

Custom Search

Home