Preparing the NCDC Weather Data - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Status messages are echoed to standard error with a reporter:status prefix so that

they get interpreted as MapReduce status updates. This tells Hadoop that the script is

making progress and is not hanging.

The script to run the Streaming job is as follows:

% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/

hadoop-streaming-*.jar \

-D mapred.reduce.tasks=0 \

-D mapred.map.tasks.speculative.execution=false \

-D mapred.task.timeout=12000000 \

-input ncdc_files.txt \

-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \

-output output \

-mapper load_ncdc_map.sh \

-file load_ncdc_map.sh

I set the number of reduce tasks to zero, since this is a map-only job. I also turned off

speculative execution so duplicate tasks wouldn't write the same files (although the ap-

proach discussed in Task side-effect files would have worked, too). The task timeout was

set to a high value so that Hadoop doesn't kill tasks that are taking a long time (for ex-

ample, when unarchiving files or copying to HDFS, when no progress is reported).

Finally, the files were archived on S3 by copying them from HDFS using distcp .

Search WWH ::

Custom Search

Home