Databases Reference
In-Depth Information
while (values.hasNext()) {
V2? v = values.next();
...
}
The
reduce()
method is also given an
OutputCollector
to gather its key/value out-
put, which is of type
K3
/
V3
. Somewhere in the
reduce()
method you'll call
output.collect((K3) k, (V3) v);
In addition to having consistent
K2
and
V2
types across
Mapper
and
Reducer
, you'll
also need to ensure that the key and value types used in
Mapper
and
Reducer
are con-
sistent with the input format, output key class, and output value class set in the driver.
The use of
KeyValueTextInputFormat
means that
K1
and
V1
must both be type
Text
.
The driver must call
setOutputKeyClass()
and
setOutputValueClass()
with the
classes of
K2
and
V2
, respectively.
Finally, all the key and value types
must be subtypes of
Writable
, which ensures a
serialization interface for Hadoop to send the data around in a distributed cluster. In
fact, the key types implement
WritableComparable
, a subinterface of
Writable
. The
key types need to additionally support the
compareTo()
method, as keys are used for
sorting in various places in the MapReduce framework.
4.3
Counting things
Much of what the layperson thinks of as statistics is counting, and many basic Hadoop
jobs involve counting. We've already seen the word count
example in
chapter 1.
For
the patent citation data, we may want the number of citations a patent has received.
This too is counting. The desired output would look like this:
1 2
10000 1
100000 1
1000006 1
1000007 1
1000011 1
1000017 1
1000026 1
1000033 2
1000043 1
1000044 2
1000045 1
1000046 2
1000049 1
1000051 1
1000054 1
1000065 1
1000067 3
In each record, a patent number is associated with the number of citations it has re-
ceived. We can write a MapReduce program for this task. Like we said earlier, you
hardly ever write a MapReduce program from scratch. You have an existing MapReduce