Preparing the NCDC Weather Data - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Example C-1. Bash script to process raw NCDC datafiles and store them in HDFS

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is S3 URI

read offset s3file

# Retrieve file from S3 to local disk

echo "reporter:status:Retrieving $s3file " >&2

$HADOOP_HOME/bin/hadoop fs -get $s3file .

# Un-bzip and un-tar the local file

target= ` basename $s3file .tar.bz2 `

mkdir -p $target

echo "reporter:status:Un-tarring $s3file to $target " >&2

tar jxf ` basename $s3file ` -C $target

# Un-gzip each station file and concat into one file

echo "reporter:status:Un-gzipping $target " >&2

for file in $target/*/*

do

gunzip -c $file >> $target.all

echo "reporter:status:Processed $file " >&2

done

# Put gzipped version into HDFS

echo "reporter:status:Gzipping $target and putting in HDFS" >&2

gzip -c $target.all | $HADOOP_HOME/bin/hadoop fs -put - gz/$target.gz

The input is a small text file ( ncdc_files.txt ) listing all the files to be processed (the files

start out on S3, so they are referenced using S3 URIs that Hadoop understands). Here is a

sample:

s3n://hadoopbook/ncdc/raw/isd-1901.tar.bz2

s3n://hadoopbook/ncdc/raw/isd-1902.tar.bz2

...

s3n://hadoopbook/ncdc/raw/isd-2000.tar.bz2

Because the input format is specified to be NLineInputFormat , each mapper receives

one line of input, which contains the file it has to process. The processing is explained in

the script, but briefly, it unpacks the bzip2 file and then concatenates each station file into

a single file for the whole year. Finally, the file is gzipped and copied into HDFS. Note the

use of hadoop fs -put - to consume from standard input.

Search WWH ::

Custom Search

Home