MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

The base path specified in the write() method of MultipleOutputs is interpreted

relative to the output directory, and because it may contain file path separator characters

( / ), it's possible to create subdirectories of arbitrary depth. For example, the following

modification partitions the data by station and year so that each year's data is contained in

a directory named by the station ID (such as 029070-99999/1901/part-r-00000 ):

@Override

protected void reduce ( Text key , Iterable < Text > values , Context

context )

throws IOException , InterruptedException {

for ( Text value : values ) {

parser . parse ( value );

String basePath = String . format ( "%s/%s/part" ,

parser . getStationId (), parser . getYear ());

multipleOutputs . write ( NullWritable . get (), value , basePath );

}

MultipleOutputs delegates to the mapper's OutputFormat . In this example it's a

TextOutputFormat , but more complex setups are possible. For example, you can cre-

ate named outputs, each with its own OutputFormat and key and value types (which

may differ from the output types of the mapper or reducer). Furthermore, the mapper or

reducer (or both) may write to multiple output files for each record processed. Consult the

Java documentation for more information.

Lazy Output

FileOutputFormat subclasses will create output ( part-r- nnnnn ) files, even if they

are empty. Some applications prefer that empty files not be created, which is where

LazyOutputFormat helps. It is a wrapper output format that ensures that the output

file is created only when the first record is emitted for a given partition. To use it, call its

setOutputFormatClass() method with the JobConf and the underlying output

format.

Streaming supports a -lazyOutput option to enable LazyOutputFormat .

Database Output

The output formats for writing to relational databases and to HBase are mentioned in

Database Input (and Output) .

[ 55 ] But see the classes in org.apache.hadoop.mapred for the old MapReduce API counterparts.

[ 56 ] This is how the mapper in SortValidator.RecordStatsChecker is implemented.

Search WWH ::

Custom Search

Home