MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Analyzing the Data with Unix Tools

What's the highest recorded global temperature for each year in the dataset? We will an-

swer this first without using Hadoop, as this information will provide a performance

baseline and a useful means to check our results.

The classic tool for processing line-oriented data is awk . Example 2-2 is a small script to

calculate the maximum temperature for each year.

Example 2-2. A program for finding the maximum recorded temperature by year from

NCDC weather records

#!/usr/bin/env bash

for year in all/*

do

echo -ne ` basename $year .gz `"\t"

gunzip -c $year | \

awk '{ temp = substr($0, 88, 5) + 0;

q = substr($0, 93, 1);

if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }

END { print max }'

done

The script loops through the compressed year files, first printing the year, and then process-

ing each file using awk . The awk script extracts two fields from the data: the air temperat-

ure and the quality code. The air temperature value is turned into an integer by adding 0.

Next, a test is applied to see whether the temperature is valid (the value 9999 signifies a

missing value in the NCDC dataset) and whether the quality code indicates that the reading

is not suspect or erroneous. If the reading is OK, the value is compared with the maximum

value seen so far, which is updated if a new maximum is found. The END block is executed

after all the lines in the file have been processed, and it prints the maximum value.

Here is the beginning of a run:

% ./max_temperature.sh

1901 317

1902 244

1903 289

1904 256

1905 283

...

The temperature values in the source file are scaled by a factor of 10, so this works out as a

maximum temperature of 31.7°C for 1901 (there were very few readings at the beginning

Search WWH ::

Custom Search

Home