Cloudware Vendor Solutions - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

learning, finance analyses, scientific simulations, or research in the bioin-

formatics field.

The general characteristics of the Hadoop system are as follows:

• Scalability : It is possible to process data sets with a volume of several

petabytes (PB) by distributing them to several thousand nodes of a

computer cluster.

• Efficiency: : Parallel data processing and a distributed file system

allow to manipulate the data quickly.

• Reliability : Multiple copies of the data can be created and managed.

In case a cluster node fails, the workflow reorganizes itself without

user intervention. Hence, automatic error-correction is possible.

Hadoop has been designed with scalability in mind so that cluster sizes of

up to 10,000 nodes can be realized. The largest Hadoop cluster at Yahoo! cur-

rently comprises 32,000 cores in 4,000 nodes, where 16 PB of data are stored

and processed. It takes about 16 h to analyze and sort a 1 PB data set on this

cluster.

16.4.5.1 MapReduce

Hadoop implements the MapReduce programming model, which is also of

great importance in the Google search engine and applications (see Section

17.3 Hadoop). Even though the model relies on massive parallel processing of

data, it has a functional approach.

In principle, it has two functions:

1. Map function : Reads key/value pairs to generate intermediate results,

which are then output in the form of new key/value pairs.

2. Reduce function : Reads all intermediate results, groups them by keys,

and generates aggregated output for each key.

Usually, the procedures generate lists or queues to store the results of the

individual steps. As an example, let is look at how the vocabulary in a text

collection can be acquired: The Map function extracts the individual words

from the text, the Reduce function reads them, counts the number of occur-

rences, and stores the result in a list. In parallel processing, Hadoop distrib-

utes the texts or text fragments to the available nodes of a computer cluster.

The Map nodes process the fragments assigned to them and output the indi-

vidual words. These outputs are available to all nodes via a distributed file

system. The Reduce nodes then read the word lists and count the number of

words. Since counting can only start after all words have been processed by

the Map function, a bottleneck might arise here.

Search WWH ::

Custom Search

Home