Database Reference
In-Depth Information
A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors collect
data every hour at many locations across the globe and gather a large volume of log data,
which is a good candidate for analysis with MapReduce because we want to process all the
data, and the data is semi-structured and record-oriented.
Data Format
The data we will use is from the National Climatic Data Center , or NCDC. The data is
stored using a line-oriented ASCII format, in which each line is a record. The format sup-
ports a rich set of meteorological elements, many of which are optional or with variable
data lengths. For simplicity, we focus on the basic elements, such as temperature, which are
always present and are of fixed width.
Example 2-1 shows a sample line with some of the salient fields annotated. The line has
been split into multiple lines to show each field; in the real file, fields are packed into one
line with no delimiters.
Example 2-1. Format of a National Climatic Data Center record
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
Search WWH ::




Custom Search