MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

you are provided an iterator over each key group, in Streaming you have to find key group

boundaries in your program.

For each line, we pull out the key and value. Then, if we've just finished a group

( last_key && last_key != key ), we write the key and the maximum temperat-

ure for that group, separated by a tab character, before resetting the maximum temperature

for the new key. If we haven't just finished a group, we just update the maximum temper-

ature for the current key.

The last line of the program ensures that a line is written for the last key group in the in-

put.

We can now simulate the whole MapReduce pipeline with a Unix pipeline (which is equi-

valent to the Unix pipeline shown in Figure 2-1 ):

% cat input/ncdc/sample.txt | \

ch02-mr-intro/src/main/ruby/max_temperature_map.rb | \

sort | ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb

1949 111

1950 22

The output is the same as that of the Java program, so the next step is to run it using Ha-

doop itself.

The hadoop command doesn't support a Streaming option; instead, you specify the

Streaming JAR file along with the jar option. Options to the Streaming program specify

the input and output paths and the map and reduce scripts. This is what it looks like:

% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/

hadoop-streaming-*.jar \

-input input/ncdc/sample.txt \

-output output \

-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \

-reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb

When running on a large dataset on a cluster, we should use the -combiner option to set

the combiner:

% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/

hadoop-streaming-*.jar \

-files ch02-mr-intro/src/main/ruby/max_temperature_map.rb,\

ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb \

-input input/ncdc/all \

-output output \

-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \

Search WWH ::

Custom Search

Home