Database Reference
In-Depth Information
tity functions, which by definition preserve type. Most MapReduce programs, however,
don't use the same key or value types throughout, so you need to configure the job to de-
clare the types you are using, as described in the previous section.
Records are sorted by the MapReduce system before being presented to the reducer. In
this case, the keys are sorted numerically, which has the effect of interleaving the lines
from the input files into one combined output file.
The default output format is TextOutputFormat , which writes out records, one per
line, by converting keys and values to strings and separating them with a tab character.
This is why the output is tab-separated: it is a feature of TextOutputFormat .
The default Streaming job
In Streaming, the default job is similar, but not identical, to the Java equivalent. The basic
form is:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/
hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper /bin/cat
When we specify a non-Java mapper and the default text mode is in effect ( -io text ),
Streaming does something special. It doesn't pass the key to the mapper process; it just
passes the value. (For other input formats, the same effect can be achieved by setting
stream.map.input.ignoreKey to true .) This is actually very useful because the
key is just the line offset in the file and the value is the line, which is all most applications
are interested in. The overall effect of this job is to perform a sort of the input.
With more of the defaults spelled out, the command looks like this (notice that Streaming
uses the old MapReduce API classes):
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/
hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
-io text
Search WWH ::




Custom Search