Core Technologies - Field Guide to Hadoop

Database Reference

In-Depth Information

word, and the value is a count of how many times it appears. Hadoop takes the result of your

map job and sorts it. For each map, a hash value is created to assign it to a reducer in a step

called the shuffle. The reducer would sum all the maps for each word in its input stream and

produce a sorted list of words in the document. You can think of mappers as programs that

extract data from HDFS files into maps, and reducers as programs that take the output from

the mappers and aggregate results. The tutorials linked in the following section explain this

in greater detail.

You'll be pleased to know that much of the hard work—dividing up the input datasets, as-

signing the mappers and reducers to nodes, shuffling the data from the mappers to the redu-

cers, and writing out the final results to the HDFS—is managed by Hadoop itself. Program-

mers merely have to write the map and reduce functions. Mappers and reducers are usually

written in Java (as in the example cited at the conclusion of this section), and writing

MapReduce code is nontrivial for novices. To that end, higher-level constructs have been de-

veloped to do this. Pig is one example and will be discussed here . Hadoop Streaming is an-

other.

Tutorial Links

There are a number of excellent tutorials for working with MapReduce. A good place to start

is the official Apache documentation , but Yahoo! has also put together a tutorial module . The

folks at MapR, a commercial software company that makes a Hadoop distribution, have a

great presentation on writing MapReduce.

Example Code

Writing MapReduce can be fairly complicated and is beyond the scope of this topic. A typic-

al application that folks write to get started is a simple word count. The official documenta-

tion includes a tutorial for building that application.

Search WWH ::

Custom Search

Home