Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

well suited to analysis with Hadoop. Note that Hadoop can perform joins; it's just that

they are not used as much as in the relational world.

MapReduce — and the other processing models in Hadoop — scales linearly with the size

of the data. Data is partitioned, and the functional primitives (like map and reduce) can

work in parallel on separate partitions. This means that if you double the size of the input

data, a job will run twice as slowly. But if you also double the size of the cluster, a job will

run as fast as the original one. This is not generally true of SQL queries.

Grid Computing

The high-performance computing (HPC) and grid computing communities have been do-

ing large-scale data processing for years, using such application program interfaces (APIs)

as the Message Passing Interface (MPI). Broadly, the approach in HPC is to distribute the

work across a cluster of machines, which access a shared filesystem, hosted by a storage

area network (SAN). This works well for predominantly compute-intensive jobs, but it be-

comes a problem when nodes need to access larger data volumes (hundreds of gigabytes,

the point at which Hadoop really starts to shine), since the network bandwidth is the bot-

tleneck and compute nodes become idle.

Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it

is local. [ 8 ] This feature, known as data locality , is at the heart of data processing in Ha-

doop and is the reason for its good performance. Recognizing that network bandwidth is

the most precious resource in a data center environment (it is easy to saturate network

links by copying data around), Hadoop goes to great lengths to conserve it by explicitly

modeling network topology. Notice that this arrangement does not preclude high-CPU

analyses in Hadoop.

MPI gives great control to programmers, but it requires that they explicitly handle the

mechanics of the data flow, exposed via low-level C routines and constructs such as sock-

ets, as well as the higher-level algorithms for the analyses. Processing in Hadoop operates

only at the higher level: the programmer thinks in terms of the data model (such as key-

value pairs for MapReduce), while the data flow remains implicit.

Coordinating the processes in a large-scale distributed computation is a challenge. The

hardest aspect is gracefully handling partial failure — when you don't know whether or

not a remote process has failed — and still making progress with the overall computation.

Distributed processing frameworks like MapReduce spare the programmer from having to

think about failure, since the implementation detects failed tasks and reschedules replace-

ments on machines that are healthy. MapReduce is able to do this because it is a shared-

nothing architecture, meaning that tasks have no dependence on one other. (This is a slight

Search WWH ::

Custom Search

Home