MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

MAPREDUCE SIGNATURES IN THE OLD API

In the old API (see Appendix D ), the signatures are very similar and actually name the type parameters

K1 , V1 , and so on, although the constraints on the types are exactly the same in both the old and new

APIs:

public interface Mapper < K1 , V1 , K2 , V2 > extends JobConfigurable , Closeable

{

void map ( K1 key , V1 value ,

OutputCollector < K2 , V2 > output , Reporter reporter ) throws

IOException ;

}

public interface Reducer < K2 , V2 , K3 , V3 > extends JobConfigurable ,

Closeable {

void reduce ( K2 key , Iterator < V2 > values ,

OutputCollector < K3 , V3 > output , Reporter reporter ) throws

IOException ;

}

public interface Partitioner < K2 , V2 > extends JobConfigurable {

int getPartition ( K2 key , V2 value , int numPartitions );

}

So much for the theory. How does this help you configure MapReduce jobs? Table 8-1

summarizes the configuration options for the new API (and Table 8-2 does the same for

the old API). It is divided into the properties that determine the types and those that have

to be compatible with the configured types.

Input types are set by the input format. So, for instance, a TextInputFormat generates

keys of type LongWritable and values of type Text . The other types are set explicitly

by calling the methods on the Job (or JobConf in the old API). If not set explicitly, the

intermediate types default to the (final) output types, which default to LongWritable

and Text . So, if K2 and K3 are the same, you don't need to call setMapOut-

putKeyClass() , because it falls back to the type set by calling setOut-

putKeyClass() . Similarly, if V2 and V3 are the same, you only need to use setOut-

putValueClass() .

It may seem strange that these methods for setting the intermediate and final output types

exist at all. After all, why can't the types be determined from a combination of the mapper

and the reducer? The answer has to do with a limitation in Java generics: type erasure

means that the type information isn't always present at runtime, so Hadoop has to be giv-

en it explicitly. This also means that it's possible to configure a MapReduce job with in-

compatible types, because the configuration isn't checked at compile time. The settings

Search WWH ::

Custom Search

Home