Databases Reference
In-Depth Information
120
100
80
60
40
20
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
Year
Figure 4.4 The number of countries with U.S. patents granted in each year. We performed
the computation with a MapReduce job and graphed the result with Excel.
Listing 4.10 UniqueCount.py: a wrapper around the UniqValueCount function
#!/usr/bin/env python
import sys
index1 = int(sys.argv[1])
index2 = int(sys.argv[2])
for line in sys.stdin:
fields = line.split(",")
print "UniqValueCount:" + fields[index1] + "\t" + fields[index2]
The aggregate function ValueHistogram
is the most ambitious function in the Aggre-
gate package. For each key, it outputs the following:
The number of unique values
1
The minimum
count
2
The median
count
3
The maximum
count
4
The average
count
5
The standard deviation
6
In its most general form, it expects the output of the mapper to have the form
ValueHistogram: key\tvalue\tcount
We specify the function ValueHistogram followed by a colon, followed by a tab-
separated key, value, and count triplet. The Aggregate reducer outputs the six statistics
above for each key. Note that for everything except the first statistics (number of
unique values) the counts are summed over each key/value pair. Outputting two
records from your mapper as
ValueHistogram: key_a \t value_a \t10
ValueHistogram: key_a \t value_a \t20
 
 
Search WWH ::




Custom Search