Databases Reference
In-Depth Information
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
Year
Figure 4.3 Using Hadoop to count patents published each year and Excel to plot the
result. This analysis using Hadoop quickly shows the annual patent output to have
almost quadrupled in 40 years.
shown in figure 4.3, we can plot the data to visualize
it better. You'll see that it has a
mostly steady upward trend.
Looking at the list of functions in the Aggregate package in table 4.3, you'll find that
most of them are combinations of maximum, minimum, and sum for atomic data type.
(For some reason DoubleValueMax and DoubleValueMin aren't supported. They
would be trivial modifications of LongValueMax and LongValueMin and an added
advantage.) UniqValueCount and ValueHistogram are slightly different and we look
at some examples of how to use them.
UniqValueCount gives the number of unique values
for each key. For example, we
may want to know whether more countries are participating in the U.S. patent system
over time. We can examine this by looking at the number of countries with patents
granted each year. We use a straightforward wrapper of UniqValueCount in listing
4.10 and apply it to the year and country columns of apat63_99.txt (column index
of 1 and 4, respectively).
bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input input/apat63_99.txt
-output output
-file UniqueCount.py
-mapper 'UniqueCount.py 1 4'
-reducer aggregate
In the output we get one record for each year. Plotting it gives us figure 4.4.
We can
see that the increasing number of patents granted from 1960 to 1990 (from figure 4.3)
didn't come from more countries ( figure 4.4). The same number of countries had
filed more.
 
 
Search WWH ::




Custom Search