Database Reference
In-Depth Information
Appendix C. Preparing the NCDC
Weather Data
This appendix gives a runthrough of the steps taken to prepare the raw weather datafiles so
they are in a form that is amenable to analysis using Hadoop. If you want to get a copy of
the data to process using Hadoop, you can do so by following the instructions given at the
website that accompanies this topic . The rest of this appendix explains how the raw weath-
er datafiles were processed.
The raw data is provided as a collection of tar files, compressed with bzip2. Each year's
worth of readings comes in a separate file. Here's a partial directory listing of the files:
1901.tar.bz2
1902.tar.bz2
1903.tar.bz2
...
2000.tar.bz2
Each tar file contains a file for each weather station's readings for the year, compressed
with gzip. (The fact that the files in the archive are compressed makes the bzip2 compres-
sion on the archive itself redundant.) For example:
% tar jxf 1901.tar.bz2
% ls 1901 | head
029070-99999-1901.gz
029500-99999-1901.gz
029600-99999-1901.gz
029720-99999-1901.gz
029810-99999-1901.gz
227070-99999-1901.gz
Because there are tens of thousands of weather stations, the whole dataset is made up of a
large number of relatively small files. It's generally easier and more efficient to process a
smaller number of relatively large files in Hadoop (see Small files and CombineFileIn-
putFormat ) , so in this case, I concatenated the decompressed files for a whole year into a
single file, named by the year. I did this using a MapReduce program, to take advantage of
its parallel processing capabilities. Let's take a closer look at the program.
The program has only a map function. No reduce function is needed because the map does
all the file processing in parallel with no combine stage. The processing can be done with a
Unix script, so the Streaming interface to MapReduce is appropriate in this case; see
Example C-1 .
Search WWH ::




Custom Search