Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

records by dropping the year component. The reduce function then takes the mean

of the maximum temperatures for each station-day-month key.

The output from the first stage looks like this for the station we are interested in (the

mean_max_daily_temp.sh script in the examples provides an implementation in Hadoop

Streaming):

029070-99999 19010101 0

029070-99999 19020101 -94

...

The first two fields form the key, and the final column is the maximum temperature from

all the readings for the given station and date. The second stage averages these daily max-

ima over years to yield:

029070-99999 0101 -68

which is interpreted as saying the mean maximum daily temperature on January 1 for sta-

tion 029070-99999 over the century is −6.8°C.

It's possible to do this computation in one MapReduce stage, but it takes more work on

the part of the programmer. [ 50 ]

The arguments for having more (but simpler) MapReduce stages are that doing so leads to

more composable and more maintainable mappers and reducers. Some of the case studies

referred to in Part V cover real-world problems that were solved using MapReduce, and in

each case, the data processing task is implemented using two or more MapReduce jobs.

The details in that chapter are invaluable for getting a better idea of how to decompose a

processing problem into a MapReduce workflow.

It's possible to make map and reduce functions even more composable than we have done.

A mapper commonly performs input format parsing, projection (selecting the relevant

fields), and filtering (removing records that are not of interest). In the mappers you have

seen so far, we have implemented all of these functions in a single mapper. However,

there is a case for splitting these into distinct mappers and chaining them into a single

mapper using the ChainMapper library class that comes with Hadoop. Combined with a

ChainReducer , you can run a chain of mappers, followed by a reducer and another

chain of mappers, in a single MapReduce job.

JobControl

When there is more than one job in a MapReduce workflow, the question arises: how do

you manage the jobs so they are executed in order? There are several approaches, and the

Search WWH ::

Custom Search

Home