Databases Reference
In-Depth Information
NOTE For those interested, the NBER
website from where we get the patent
data also has a file (list_of_countries.txt) that shows the full country
name for each country code.
Looking at the output of our job and the country
codes, we see that Andorra (AD) patents have an average 14 claims. Arab
Emirates (AE) patents average 15.4 claims. Antigua and Barbuda (AG) patents
average 13.25 claims, and so forth.
4.5.4
Streaming with the Aggregate package
Hadoop includes a library package called Aggregate that simplifies obtaining aggre-
gate statistics of a data set. This package can simplify the writing of Java statistics collec-
tors, especially when used with Streaming, which is the focus of this section. 9
The Aggregate package under Streaming functions as a reducer that computes
aggregate statistics. You only have to provide a mapper that processes records and
sends out a specially formatted output. Each line of the mapper's output looks like
function: key \t value
The output string starts with the name of a value aggregator function (from the set of
predefined functions available in the Aggregate package). A colon and a tab-separated
key/value pair follows. The Aggregate reducer applies the function to the set of values
for each key. For example, if the function is LongValueSum , then the output is the sum
of values for each key. (As the function name implies, each value is treated as a Java
long type.) If the function is LongValueMax , then the output is the maximum
value
for each key. You can see the list of aggregator functions supported in the Aggregate
package in table 4.3.
Table 4.3 List of value aggregator functions supported by the Aggregate package
Value aggregator
Description
DoubleValueSum
Sums up a sequence of double values.
LongValueMax
Finds the maximum of a sequence of long values.
LongValueMin
Finds the minimum of a sequence of long values.
LongValueSum
Sums up a sequence of long values.
Finds the lexicographical maximum of a sequence of
string values.
StringValueMax
StringValueMin
Finds the lexicographical minimum of a sequence of
string values.
Finds the number of unique values (for each key).
UniqValueCount
ValueHistogram
Finds the count, minimum, median, maximum, average,
and standard deviation of each value. (See text for
further explanation.)
9 Using the Aggregate package in Java is explained in http://hadoop.apache.org/core/docs/current/api/
org/apache/hadoop/mapred/lib/aggregate/package-summary.html.
 
Search WWH ::




Custom Search