Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

rather than as composed of a key and a value. The truth is that Streaming works on

key/value pairs just like the standard Java MapReduce model. By default, Streaming

uses the tab character to separate the key from the value in a record. When there's no

tab character, the entire record is considered the key and the value is empty text. For

our data sets, which have no tab character, this provides the illusion that we're process-

ing each individual record as a whole unit. Furthermore, even if the records do have

tab characters in them, the Streaming API will only shuffle and sort the records in a

different order. As long as our mapper and reducer work in a record-oriented way, we

can maintain the record-oriented illusion.

Working with key/value pairs allows us to take advantage of the key-based shuffling

and sorting to create interesting data analyses. To illustrate key/value pair processing

using Streaming, we can write a program to find the maximum number of claims in a

patent for each country . This would differ from AttributeMax.py in that this is trying

to find the maximum for each key , rather than a maximum across all records. Let's

make this exercise more interesting by computing the average

rather than finding

the maximum. (As we see, Hadoop already includes a package called Aggregate that

contains classes that help find the maximum for each key.)

First, let's examine how key/value pairs work in the Streaming API for each step of

the MapReduce data flow.

As we've seen, the mapper under Streaming reads a split through STDIN and

extracts each line as a record. Your mapper can choose to interpret each input

record as a key/value pair or a line of text.

1

The Streaming API will interpret each line of your mapper's output as a key/

value pair separated by tab. Similar to the standard MapReduce model, we

apply the partitioner

2

the record

to. All key/value pairs with the same key will end up at the same reducer.

to the key to find the right reducer to shuffle

At each reducer, key/value pairs are sorted according to the key by the

Streaming API. Recall that in the Java model, all key/value pairs of the same

key are grouped together into one key and a list of values. This group is then

presented to the reduce() method. Under the Streaming API your reducer is

responsible for performing the grouping . This is not too bad as the key/value pairs

are already sorted by key. All records of the same key are in one contiguous

chunk. Your reducer will read one line at a time from STDIN and will keep

track of the new keys.

3

For all practical purposes, the output ( STDOUT ) of your reducer is written to

a file directly. Technically a no-op step is taken before the file write. In this

step the Streaming API breaks each line of the reducer's output by the tab

character and feeds the key/value pair to the default TextOutputFormat ,

which by default re-inserts the tab character before writing the result to a

file. Without tab characters in the reducer's output it will show the same

4

Hadoop in Action

Search WWH ::

Custom Search

Home