Databases Reference
In-Depth Information
write your application in the MapReduce model, Hadoop will take care of all that
scalability “plumbing” for you.
1.5.2
Scaling the same program in MapReduce
MapReduce programs are executed in two main phases, called mapping and reducing .
Each phase is defined by a data processing function, and these functions are called
mapper and reducer , respectively. In the mapping phase, MapReduce takes the input
data and feeds each data element to the mapper. In the reducing phase, the reducer
processes all the outputs from the mapper and arrives at a final result.
In simple terms, the mapper is meant to filter and transform the input into something
that the reducer can aggregate over. You may see a striking similarity here with the two
phases we had to develop in scaling up word counting. The similarity is not accidental.
The MapReduce framework was designed after a lot of experience in writing scalable,
distributed programs. This two-phase design pattern was seen in scaling many programs,
and became the basis of the framework.
In scaling our distributed word counting program in the last section, we
also had to write the partitioning and shuffling functions. Partitioning and
shuffling are common design patterns that go along with mapping and reducing.
Unlike mapping and reducing, though, partitioning and shuffling are generic
functionalities that are not too dependent on the particular data processing
application. The MapReduce framework provides a default implementation that
works in most situations.
In order for mapping, reducing, partitioning, and shuffling (and a few others we
haven't mentioned) to seamlessly work together, we need to agree on a common
structure for the data being processed. It should be flexible and powerful enough to
handle most of the targeted data processing applications. MapReduce uses lists
and
(key/value) pairs as its main data primitives. The keys and values are often integers
or strings but can also be dummy values to be ignored or complex object types. The
map and reduce functions must obey the
following constraint on the types of keys
and values.
In the MapReduce framework you write
applications by specifying the mapper
and reducer. Let's look at the complete
data flow:
Input
Output
map
<k1, v1>
list(<k2, v2>)
reduce
<k2, list(v2)>
list(<k3, v3>)
The input to your application must be structured as a list of (key/value) pairs,
list(<k1, v1>) . This input format may seem open-ended but is often quite
simple in practice. The input format for processing multiple files is usually
list(<String filename, String file_content>) . The input format for
processing one large file, such as a log file, is list(<Integer line_number,
String log_event>) .
1
 
Search WWH ::




Custom Search