Database Reference
In-Depth Information
well suited to analysis with Hadoop. Note that Hadoop can perform joins; it's just that
they are not used as much as in the relational world.
MapReduce — and the other processing models in Hadoop — scales linearly with the size
of the data. Data is partitioned, and the functional primitives (like map and reduce) can
work in parallel on separate partitions. This means that if you double the size of the input
data, a job will run twice as slowly. But if you also double the size of the cluster, a job will
run as fast as the original one. This is not generally true of SQL queries.
Grid Computing
The high-performance computing (HPC) and grid computing communities have been do-
ing large-scale data processing for years, using such application program interfaces (APIs)
as the Message Passing Interface (MPI). Broadly, the approach in HPC is to distribute the
work across a cluster of machines, which access a shared filesystem, hosted by a storage
area network (SAN). This works well for predominantly compute-intensive jobs, but it be-
comes a problem when nodes need to access larger data volumes (hundreds of gigabytes,
the point at which Hadoop really starts to shine), since the network bandwidth is the bot-
tleneck and compute nodes become idle.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it
is local. [ 8 ] This feature, known as data locality , is at the heart of data processing in Ha-
doop and is the reason for its good performance. Recognizing that network bandwidth is
the most precious resource in a data center environment (it is easy to saturate network
links by copying data around), Hadoop goes to great lengths to conserve it by explicitly
modeling network topology. Notice that this arrangement does not preclude high-CPU
analyses in Hadoop.
MPI gives great control to programmers, but it requires that they explicitly handle the
mechanics of the data flow, exposed via low-level C routines and constructs such as sock-
ets, as well as the higher-level algorithms for the analyses. Processing in Hadoop operates
only at the higher level: the programmer thinks in terms of the data model (such as key-
value pairs for MapReduce), while the data flow remains implicit.
Coordinating the processes in a large-scale distributed computation is a challenge. The
hardest aspect is gracefully handling partial failure — when you don't know whether or
not a remote process has failed — and still making progress with the overall computation.
Distributed processing frameworks like MapReduce spare the programmer from having to
think about failure, since the implementation detects failed tasks and reschedules replace-
ments on machines that are healthy. MapReduce is able to do this because it is a shared-
nothing architecture, meaning that tasks have no dependence on one other. (This is a slight
Search WWH ::




Custom Search