Database Reference
In-Depth Information
you are provided an iterator over each key group, in Streaming you have to find key group
boundaries in your program.
For each line, we pull out the key and value. Then, if we've just finished a group
( last_key && last_key != key ), we write the key and the maximum temperat-
ure for that group, separated by a tab character, before resetting the maximum temperature
for the new key. If we haven't just finished a group, we just update the maximum temper-
ature for the current key.
The last line of the program ensures that a line is written for the last key group in the in-
put.
We can now simulate the whole MapReduce pipeline with a Unix pipeline (which is equi-
valent to the Unix pipeline shown in Figure 2-1 ):
% cat input/ncdc/sample.txt | \
ch02-mr-intro/src/main/ruby/max_temperature_map.rb | \
sort | ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
The output is the same as that of the Java program, so the next step is to run it using Ha-
doop itself.
The hadoop command doesn't support a Streaming option; instead, you specify the
Streaming JAR file along with the jar option. Options to the Streaming program specify
the input and output paths and the map and reduce scripts. This is what it looks like:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/
hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \
-reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
When running on a large dataset on a cluster, we should use the -combiner option to set
the combiner:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/
hadoop-streaming-*.jar \
-files ch02-mr-intro/src/main/ruby/max_temperature_map.rb,\
ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb \
-input input/ncdc/all \
-output output \
-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \
Search WWH ::




Custom Search