Database Reference
In-Depth Information
Analyzing the Data with Unix Tools
What's the highest recorded global temperature for each year in the dataset? We will an-
swer this first without using Hadoop, as this information will provide a performance
baseline and a useful means to check our results.
The classic tool for processing line-oriented data is awk . Example 2-2 is a small script to
calculate the maximum temperature for each year.
Example 2-2. A program for finding the maximum recorded temperature by year from
NCDC weather records
#!/usr/bin/env bash
for year in all/*
do
echo -ne ` basename $year .gz `"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
The script loops through the compressed year files, first printing the year, and then process-
ing each file using awk . The awk script extracts two fields from the data: the air temperat-
ure and the quality code. The air temperature value is turned into an integer by adding 0.
Next, a test is applied to see whether the temperature is valid (the value 9999 signifies a
missing value in the NCDC dataset) and whether the quality code indicates that the reading
is not suspect or erroneous. If the reading is OK, the value is compared with the maximum
value seen so far, which is updated if a new maximum is found. The END block is executed
after all the lines in the file have been processed, and it prints the maximum value.
Here is the beginning of a run:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
The temperature values in the source file are scaled by a factor of 10, so this works out as a
maximum temperature of 31.7°C for 1901 (there were very few readings at the beginning
Search WWH ::




Custom Search