Information Technology Reference
In-Depth Information
learning, finance analyses, scientific simulations, or research in the bioin-
formatics field.
The general characteristics of the Hadoop system are as follows:
• Scalability : It is possible to process data sets with a volume of several
petabytes (PB) by distributing them to several thousand nodes of a
computer cluster.
• Efficiency: : Parallel data processing and a distributed file system
allow to manipulate the data quickly.
• Reliability : Multiple copies of the data can be created and managed.
In case a cluster node fails, the workflow reorganizes itself without
user intervention. Hence, automatic error-correction is possible.
Hadoop has been designed with scalability in mind so that cluster sizes of
up to 10,000 nodes can be realized. The largest Hadoop cluster at Yahoo! cur-
rently comprises 32,000 cores in 4,000 nodes, where 16 PB of data are stored
and processed. It takes about 16 h to analyze and sort a 1 PB data set on this
cluster.
16.4.5.1 MapReduce
Hadoop implements the MapReduce programming model, which is also of
great importance in the Google search engine and applications (see Section
17.3 Hadoop). Even though the model relies on massive parallel processing of
data, it has a functional approach.
In principle, it has two functions:
1. Map function : Reads key/value pairs to generate intermediate results,
which are then output in the form of new key/value pairs.
2. Reduce function : Reads all intermediate results, groups them by keys,
and generates aggregated output for each key.
Usually, the procedures generate lists or queues to store the results of the
individual steps. As an example, let is look at how the vocabulary in a text
collection can be acquired: The Map function extracts the individual words
from the text, the Reduce function reads them, counts the number of occur-
rences, and stores the result in a list. In parallel processing, Hadoop distrib-
utes the texts or text fragments to the available nodes of a computer cluster.
The Map nodes process the fragments assigned to them and output the indi-
vidual words. These outputs are available to all nodes via a distributed file
system. The Reduce nodes then read the word lists and count the number of
words. Since counting can only start after all words have been processed by
the Map function, a bottleneck might arise here.
Search WWH ::




Custom Search