Database Reference
In-Depth Information
word, and the value is a count of how many times it appears. Hadoop takes the result of your
map job and sorts it. For each map, a hash value is created to assign it to a reducer in a step
called the shuffle. The reducer would sum all the maps for each word in its input stream and
produce a sorted list of words in the document. You can think of mappers as programs that
extract data from HDFS files into maps, and reducers as programs that take the output from
the mappers and aggregate results. The tutorials linked in the following section explain this
in greater detail.
You'll be pleased to know that much of the hard work—dividing up the input datasets, as-
signing the mappers and reducers to nodes, shuffling the data from the mappers to the redu-
cers, and writing out the final results to the HDFS—is managed by Hadoop itself. Program-
mers merely have to write the map and reduce functions. Mappers and reducers are usually
written in Java (as in the example cited at the conclusion of this section), and writing
MapReduce code is nontrivial for novices. To that end, higher-level constructs have been de-
veloped to do this. Pig is one example and will be discussed here . Hadoop Streaming is an-
other.
Tutorial Links
There are a number of excellent tutorials for working with MapReduce. A good place to start
is the official Apache documentation , but Yahoo! has also put together a tutorial module . The
folks at MapR, a commercial software company that makes a Hadoop distribution, have a
great presentation on writing MapReduce.
Example Code
Writing MapReduce can be fairly complicated and is beyond the scope of this topic. A typic-
al application that folks write to get started is a simple word count. The official documenta-
tion includes a tutorial for building that application.
Search WWH ::




Custom Search