Database Reference
In-Depth Information
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express our
query as a MapReduce job. After some local, small-scale testing, we will be able to run it
on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and the re-
duce phase. Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer. The programmer also specifies two functions: the map function
and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that gives
us each line in the dataset as a text value. The key is the offset of the beginning of the line
from the beginning of the file, but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, because these are
the only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reduce function can do its work on it: find-
ing the maximum temperature for each year. The map function is also a good place to drop
bad records: here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...)
(106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...)
(212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...)
(318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...)
(424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits
them as its output (the temperature values have been interpreted as integers):
Search WWH ::




Custom Search