Databases Reference
In-Depth Information
is no different than outputting a single record with the sum
ValueHistogram: key_a \t value_a \t30
A useful variation is for the mapper to only output the key and value, without the count
and the tab character that goes with it. ValueHistogram automatically assumes a count
of 1 in this case. Listing 4.11 shows a trivial wrapper around ValueHistogram .
Listing 4.11 ValueHistogram.py: wrapper around Aggregate package's ValueHistogram
#!/usr/bin/env python
import sys
index1 = int(sys.argv[1])
index2 = int(sys.argv[2])
for line in sys.stdin:
fields = line.split(",")
print "ValueHistogram:" + fields[index1] + "\t" + fields[index2]
We run this program to find the distribution of countries with patents granted for
each year.
bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input input/apat63_99.txt
-output output
-file ValueHist.py
-mapper 'ValueHist.py 1 4'
-reducer aggregate
The output is a tab-separated value (TSV) file with seven columns. The first column,
the year of patent granted, is the key. The other six columns are the six statistics the
ValueHistogram is set to compute. A partial view of the output is here (we skip the
first two rows for formatting reasons):
1964 58 1 7 38410 816.8103448275862 4997.413601595352
1965 67 1 5 50331 938.1641791044776 6104.779230296307
1966 71 1 5 54634 963.4507042253521 6443.625995189338
1967 68 1 8 51274 965.4705882352941 6177.445623039149
1968 71 1 7 45781 832.4507042253521 5401.229955880634
1969 68 1 8 50394 993.5147058823529 6080.713518728092
1970 72 1 7 47073 894.8472222222222 5527.883233761672
1971 74 1 9 55976 1058.337837837838 6492.837390992137
The first column after the year is the number of unique values. This is exactly the same
as the output of UniqValueCount . The second, third, and fourth columns are the mini-
mum, median, and maximum, respectively. For the patent data set we used, we see
that (for every year) the country receiving the fewest granted patents (other than 0)
received 1. Looking specifically at the output for 1964, the country receiving the most
patents received 38410 patents, whereas half the countries received less than 7 patents.
The average number of patents a country received in 1964 is 816.8 with a standard devi-
ation of 4997.4. Needless to say, the number of patents granted to each country is highly
skewed, given the discrepancy between the median (7) and the average
(816.8).
 
 
Search WWH ::




Custom Search