Preparing the NCDC Weather Data - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Appendix C. Preparing the NCDC

Weather Data

This appendix gives a runthrough of the steps taken to prepare the raw weather datafiles so

they are in a form that is amenable to analysis using Hadoop. If you want to get a copy of

the data to process using Hadoop, you can do so by following the instructions given at the

website that accompanies this topic . The rest of this appendix explains how the raw weath-

er datafiles were processed.

The raw data is provided as a collection of tar files, compressed with bzip2. Each year's

worth of readings comes in a separate file. Here's a partial directory listing of the files:

1901.tar.bz2

1902.tar.bz2

1903.tar.bz2

...

2000.tar.bz2

Each tar file contains a file for each weather station's readings for the year, compressed

with gzip. (The fact that the files in the archive are compressed makes the bzip2 compres-

sion on the archive itself redundant.) For example:

% tar jxf 1901.tar.bz2

% ls 1901 | head

029070-99999-1901.gz

029500-99999-1901.gz

029600-99999-1901.gz

029720-99999-1901.gz

029810-99999-1901.gz

227070-99999-1901.gz

Because there are tens of thousands of weather stations, the whole dataset is made up of a

large number of relatively small files. It's generally easier and more efficient to process a

smaller number of relatively large files in Hadoop (see Small files and CombineFileIn-

putFormat ) , so in this case, I concatenated the decompressed files for a whole year into a

single file, named by the year. I did this using a MapReduce program, to take advantage of

its parallel processing capabilities. Let's take a closer look at the program.

The program has only a map function. No reduce function is needed because the map does

all the file processing in parallel with no combine stage. The processing can be done with a

Unix script, so the Streaming interface to MapReduce is appropriate in this case; see

Example C-1 .

Search WWH ::

Custom Search

Home