MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

}

public static void printUsage ( Tool tool , String extraArgsUsage ) {

System . err . printf ( "Usage: %s [genericOptions] %s\n\n" ,

tool . getClass (). getSimpleName (), extraArgsUsage );

GenericOptionsParser . printGenericCommandUsage ( System . err );

}

Going back to MinimalMapReduceWithDefaults in Example 8-1 , although there

are many other default job settings, the ones bolded are those most central to running a

job. Let's go through them in turn.

The default input format is TextInputFormat , which produces keys of type

LongWritable (the offset of the beginning of the line in the file) and values of type

Text (the line of text). This explains where the integers in the final output come from:

they are the line offsets.

The default mapper is just the Mapper class, which writes the input key and value un-

changed to the output:

public class Mapper < KEYIN , VALUEIN , KEYOUT , VALUEOUT > {

protected void map ( KEYIN key , VALUEIN value ,

Context context ) throws IOException , InterruptedException {

context . write (( KEYOUT ) key , ( VALUEOUT ) value );

}

Mapper is a generic type, which allows it to work with any key or value types. In this

case, the map input and output key is of type LongWritable , and the map input and

output value is of type Text .

The default partitioner is HashPartitioner , which hashes a record's key to determine

which partition the record belongs in. Each partition is processed by a reduce task, so the

number of partitions is equal to the number of reduce tasks for the job:

public class HashPartitioner < K , V > extends Partitioner < K , V > {

public int getPartition ( K key , V value ,

int numReduceTasks ) {

return ( key . hashCode () & Integer . MAX_VALUE ) % numReduceTasks ;

}

Search WWH ::

Custom Search

Home