Introducing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

write your application in the MapReduce model, Hadoop will take care of all that

scalability “plumbing” for you.

1.5.2

Scaling the same program in MapReduce

MapReduce programs are executed in two main phases, called mapping and reducing .

Each phase is defined by a data processing function, and these functions are called

mapper and reducer , respectively. In the mapping phase, MapReduce takes the input

data and feeds each data element to the mapper. In the reducing phase, the reducer

processes all the outputs from the mapper and arrives at a final result.

In simple terms, the mapper is meant to filter and transform the input into something

that the reducer can aggregate over. You may see a striking similarity here with the two

phases we had to develop in scaling up word counting. The similarity is not accidental.

The MapReduce framework was designed after a lot of experience in writing scalable,

distributed programs. This two-phase design pattern was seen in scaling many programs,

and became the basis of the framework.

In scaling our distributed word counting program in the last section, we

also had to write the partitioning and shuffling functions. Partitioning and

shuffling are common design patterns that go along with mapping and reducing.

Unlike mapping and reducing, though, partitioning and shuffling are generic

functionalities that are not too dependent on the particular data processing

application. The MapReduce framework provides a default implementation that

works in most situations.

In order for mapping, reducing, partitioning, and shuffling (and a few others we

haven't mentioned) to seamlessly work together, we need to agree on a common

structure for the data being processed. It should be flexible and powerful enough to

handle most of the targeted data processing applications. MapReduce uses lists

and

(key/value) pairs as its main data primitives. The keys and values are often integers

or strings but can also be dummy values to be ignored or complex object types. The

map and reduce functions must obey the

following constraint on the types of keys

and values.

In the MapReduce framework you write

applications by specifying the mapper

and reducer. Let's look at the complete

data flow:

Input

Output

map

<k1, v1>

list(<k2, v2>)

reduce

<k2, list(v2)>

list(<k3, v3>)

The input to your application must be structured as a list of (key/value) pairs,

list(<k1, v1>) . This input format may seem open-ended but is often quite

simple in practice. The input format for processing multiple files is usually

list(<String filename, String file_content>) . The input format for

processing one large file, such as a log file, is list(<Integer line_number,

String log_event>) .

1

Hadoop in Action

Search WWH ::

Custom Search

Home