Database Reference
In-Depth Information
MAPREDUCE SIGNATURES IN THE OLD API
In the old API (see
Appendix D
), the signatures are very similar and actually name the type parameters
K1
,
V1
, and so on, although the constraints on the types are exactly the same in both the old and new
APIs:
public interface
Mapper
<
K1
,
V1
,
K2
,
V2
>
extends
JobConfigurable
,
Closeable
{
void
map
(
K1 key
,
V1 value
,
OutputCollector
<
K2
,
V2
>
output
,
Reporter reporter
)
throws
IOException
;
}
public interface
Reducer
<
K2
,
V2
,
K3
,
V3
>
extends
JobConfigurable
,
Closeable
{
void
reduce
(
K2 key
,
Iterator
<
V2
>
values
,
OutputCollector
<
K3
,
V3
>
output
,
Reporter reporter
)
throws
IOException
;
}
public interface
Partitioner
<
K2
,
V2
>
extends
JobConfigurable
{
int
getPartition
(
K2 key
,
V2 value
,
int
numPartitions
);
}
So much for the theory. How does this help you configure MapReduce jobs?
Table 8-1
summarizes the configuration options for the new API (and
Table 8-2
does the same for
the old API). It is divided into the properties that determine the types and those that have
to be compatible with the configured types.
Input types are set by the input format. So, for instance, a
TextInputFormat
generates
keys of type
LongWritable
and values of type
Text
. The other types are set explicitly
by calling the methods on the
Job
(or
JobConf
in the old API). If not set explicitly, the
intermediate types default to the (final) output types, which default to
LongWritable
and
Text
. So, if
K2
and
K3
are the same, you don't need to call
setMapOut-
putKeyClass()
, because it falls back to the type set by calling
setOut-
putKeyClass()
. Similarly, if
V2
and
V3
are the same, you only need to use
setOut-
putValueClass()
.
It may seem strange that these methods for setting the intermediate and final output types
exist at all. After all, why can't the types be determined from a combination of the mapper
and the reducer? The answer has to do with a limitation in Java generics: type erasure
means that the type information isn't always present at runtime, so Hadoop has to be giv-
en it explicitly. This also means that it's possible to configure a MapReduce job with in-
compatible types, because the configuration isn't checked at compile time. The settings