Database Reference
In-Depth Information
Status messages are echoed to standard error with a reporter:status prefix so that
they get interpreted as MapReduce status updates. This tells Hadoop that the script is
making progress and is not hanging.
The script to run the Streaming job is as follows:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/
hadoop-streaming-*.jar \
-D mapred.reduce.tasks=0 \
-D mapred.map.tasks.speculative.execution=false \
-D mapred.task.timeout=12000000 \
-input ncdc_files.txt \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
-output output \
-mapper load_ncdc_map.sh \
-file load_ncdc_map.sh
I set the number of reduce tasks to zero, since this is a map-only job. I also turned off
speculative execution so duplicate tasks wouldn't write the same files (although the ap-
proach discussed in Task side-effect files would have worked, too). The task timeout was
set to a high value so that Hadoop doesn't kill tasks that are taking a long time (for ex-
ample, when unarchiving files or copying to HDFS, when no progress is reported).
Finally, the files were archived on S3 by copying them from HDFS using distcp .
Search WWH ::




Custom Search