Database Reference
In-Depth Information
records by dropping the year component. The reduce function then takes the mean
of the maximum temperatures for each station-day-month key.
The output from the first stage looks like this for the station we are interested in (the
mean_max_daily_temp.sh script in the examples provides an implementation in Hadoop
Streaming):
029070-99999 19010101 0
029070-99999 19020101 -94
...
The first two fields form the key, and the final column is the maximum temperature from
all the readings for the given station and date. The second stage averages these daily max-
ima over years to yield:
029070-99999 0101 -68
which is interpreted as saying the mean maximum daily temperature on January 1 for sta-
tion 029070-99999 over the century is −6.8°C.
It's possible to do this computation in one MapReduce stage, but it takes more work on
the part of the programmer. [ 50 ]
The arguments for having more (but simpler) MapReduce stages are that doing so leads to
more composable and more maintainable mappers and reducers. Some of the case studies
referred to in Part V cover real-world problems that were solved using MapReduce, and in
each case, the data processing task is implemented using two or more MapReduce jobs.
The details in that chapter are invaluable for getting a better idea of how to decompose a
processing problem into a MapReduce workflow.
It's possible to make map and reduce functions even more composable than we have done.
A mapper commonly performs input format parsing, projection (selecting the relevant
fields), and filtering (removing records that are not of interest). In the mappers you have
seen so far, we have implemented all of these functions in a single mapper. However,
there is a case for splitting these into distinct mappers and chaining them into a single
mapper using the ChainMapper library class that comes with Hadoop. Combined with a
ChainReducer , you can run a chain of mappers, followed by a reducer and another
chain of mappers, in a single MapReduce job.
JobControl
When there is more than one job in a MapReduce workflow, the question arises: how do
you manage the jobs so they are executed in order? There are several approaches, and the
Search WWH ::




Custom Search