Database Reference
In-Depth Information
MapReduce Workflows
So far in this chapter, you have seen the mechanics of writing a program using MapReduce.
We haven't yet considered how to turn a data processing problem into the MapReduce
model.
The data processing you have seen so far in this topic is to solve a fairly simple problem:
finding the maximum recorded temperature for given years. When the processing gets
more complex, this complexity is generally manifested by having more MapReduce jobs,
rather than having more complex map and reduce functions. In other words, as a rule of
thumb, think about adding more jobs, rather than adding complexity to jobs.
For more complex problems, it is worth considering a higher-level language than MapRe-
duce, such as Pig, Hive, Cascading, Crunch, or Spark. One immediate benefit is that it frees
you from having to do the translation into MapReduce jobs, allowing you to concentrate on
the analysis you are performing.
Finally, the topic Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris
Dyer (Morgan & Claypool Publishers, 2010) is a great resource for learning more about
MapReduce algorithm design and is highly recommended.
Decomposing a Problem into MapReduce Jobs
Let's look at an example of a more complex problem that we want to translate into a
MapReduce workflow.
Imagine that we want to find the mean maximum recorded temperature for every day of the
year and every weather station. In concrete terms, to calculate the mean maximum daily
temperature recorded by station 029070-99999, say, on January 1, we take the mean of the
maximum daily temperatures for this station for January 1, 1901; January 1, 1902; and so
on, up to January 1, 2000.
How can we compute this using MapReduce? The computation decomposes most naturally
into two stages:
1. Compute the maximum daily temperature for every station-date pair.
The MapReduce program in this case is a variant of the maximum temperature
program, except that the keys in this case are a composite station-date pair, rather
than just the year.
2. Compute the mean of the maximum daily temperatures for every station-day-month
key.
The mapper takes the output from the previous job (station-date, maximum tem-
perature) records and projects it into (station-day-month, maximum temperature)
Search WWH ::




Custom Search