Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

180000

160000

140000

120000

100000

80000

60000

40000

20000

0

1960

1965

1970

1975

1980

1985

1990

1995

2000

Year

Figure 4.3 Using Hadoop to count patents published each year and Excel to plot the

result. This analysis using Hadoop quickly shows the annual patent output to have

almost quadrupled in 40 years.

shown in figure 4.3, we can plot the data to visualize

it better. You'll see that it has a

mostly steady upward trend.

Looking at the list of functions in the Aggregate package in table 4.3, you'll find that

most of them are combinations of maximum, minimum, and sum for atomic data type.

(For some reason DoubleValueMax and DoubleValueMin aren't supported. They

would be trivial modifications of LongValueMax and LongValueMin and an added

advantage.) UniqValueCount and ValueHistogram are slightly different and we look

at some examples of how to use them.

UniqValueCount gives the number of unique values

for each key. For example, we

may want to know whether more countries are participating in the U.S. patent system

over time. We can examine this by looking at the number of countries with patents

granted each year. We use a straightforward wrapper of UniqValueCount in listing

4.10 and apply it to the year and country columns of apat63_99.txt (column index

of 1 and 4, respectively).

bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar

➥

-input input/apat63_99.txt

➥

-output output

➥

-file UniqueCount.py

➥

-mapper 'UniqueCount.py 1 4'

➥

-reducer aggregate

In the output we get one record for each year. Plotting it gives us figure 4.4.

We can

see that the increasing number of patents granted from 1960 to 1990 (from figure 4.3)

didn't come from more countries ( figure 4.4). The same number of countries had

filed more.

Hadoop in Action

Search WWH ::

Custom Search

Home