Databases Reference
In-Depth Information
rather than as composed of a key and a value. The truth is that Streaming works on
key/value pairs just like the standard Java MapReduce model. By default, Streaming
uses the tab character to separate the key from the value in a record. When there's no
tab character, the entire record is considered the key and the value is empty text. For
our data sets, which have no tab character, this provides the illusion that we're process-
ing each individual record as a whole unit. Furthermore, even if the records do have
tab characters in them, the Streaming API will only shuffle and sort the records in a
different order. As long as our mapper and reducer work in a record-oriented way, we
can maintain the record-oriented illusion.
Working with key/value pairs allows us to take advantage of the key-based shuffling
and sorting to create interesting data analyses. To illustrate key/value pair processing
using Streaming, we can write a program to find the maximum number of claims in a
patent for each country . This would differ from AttributeMax.py in that this is trying
to find the maximum for each key , rather than a maximum across all records. Let's
make this exercise more interesting by computing the average
rather than finding
the maximum. (As we see, Hadoop already includes a package called Aggregate that
contains classes that help find the maximum for each key.)
First, let's examine how key/value pairs work in the Streaming API for each step of
the MapReduce data flow.
As we've seen, the mapper under Streaming reads a split through STDIN and
extracts each line as a record. Your mapper can choose to interpret each input
record as a key/value pair or a line of text.
1
The Streaming API will interpret each line of your mapper's output as a key/
value pair separated by tab. Similar to the standard MapReduce model, we
apply the partitioner
2
the record
to. All key/value pairs with the same key will end up at the same reducer.
to the key to find the right reducer to shuffle
At each reducer, key/value pairs are sorted according to the key by the
Streaming API. Recall that in the Java model, all key/value pairs of the same
key are grouped together into one key and a list of values. This group is then
presented to the reduce() method. Under the Streaming API your reducer is
responsible for performing the grouping . This is not too bad as the key/value pairs
are already sorted by key. All records of the same key are in one contiguous
chunk. Your reducer will read one line at a time from STDIN and will keep
track of the new keys.
3
For all practical purposes, the output ( STDOUT ) of your reducer is written to
a file directly. Technically a no-op step is taken before the file write. In this
step the Streaming API breaks each line of the reducer's output by the tab
character and feeds the key/value pair to the default TextOutputFormat ,
which by default re-inserts the tab character before writing the result to a
file. Without tab characters in the reducer's output it will show the same
4
 
Search WWH ::




Custom Search