Database Reference
In-Depth Information
Example C-1. Bash script to process raw NCDC datafiles and store them in HDFS
#!/usr/bin/env bash
# NLineInputFormat gives a single line: key is offset, value is S3 URI
read offset s3file
# Retrieve file from S3 to local disk
echo "reporter:status:Retrieving $s3file " >&2
$HADOOP_HOME/bin/hadoop fs -get $s3file .
# Un-bzip and un-tar the local file
target= ` basename $s3file .tar.bz2 `
mkdir -p $target
echo "reporter:status:Un-tarring $s3file to $target " >&2
tar jxf ` basename $s3file ` -C $target
# Un-gzip each station file and concat into one file
echo "reporter:status:Un-gzipping $target " >&2
for file in $target/*/*
do
gunzip -c $file >> $target.all
echo "reporter:status:Processed $file " >&2
done
# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | $HADOOP_HOME/bin/hadoop fs -put - gz/$target.gz
The input is a small text file ( ncdc_files.txt ) listing all the files to be processed (the files
start out on S3, so they are referenced using S3 URIs that Hadoop understands). Here is a
sample:
s3n://hadoopbook/ncdc/raw/isd-1901.tar.bz2
s3n://hadoopbook/ncdc/raw/isd-1902.tar.bz2
...
s3n://hadoopbook/ncdc/raw/isd-2000.tar.bz2
Because the input format is specified to be NLineInputFormat , each mapper receives
one line of input, which contains the file it has to process. The processing is explained in
the script, but briefly, it unpacks the bzip2 file and then concatenates each station file into
a single file for the whole year. Finally, the file is gzipped and copied into HDFS. Note the
use of hadoop fs -put - to consume from standard input.
Search WWH ::




Custom Search