Database Reference
In-Depth Information
The base path specified in the
write()
method of
MultipleOutputs
is interpreted
relative to the output directory, and because it may contain file path separator characters
(
/
), it's possible to create subdirectories of arbitrary depth. For example, the following
modification partitions the data by station and year so that each year's data is contained in
a directory named by the station ID (such as
029070-99999/1901/part-r-00000
):
@Override
protected
void
reduce
(
Text key
,
Iterable
<
Text
>
values
,
Context
context
)
throws
IOException
,
InterruptedException
{
for
(
Text value
:
values
) {
parser
.
parse
(
value
);
String basePath
=
String
.
format
(
"%s/%s/part"
,
parser
.
getStationId
(),
parser
.
getYear
());
multipleOutputs
.
write
(
NullWritable
.
get
(),
value
,
basePath
);
}
}
MultipleOutputs
delegates to the mapper's
OutputFormat
. In this example it's a
TextOutputFormat
, but more complex setups are possible. For example, you can cre-
ate named outputs, each with its own
OutputFormat
and key and value types (which
may differ from the output types of the mapper or reducer). Furthermore, the mapper or
reducer (or both) may write to multiple output files for each record processed. Consult the
Java documentation for more information.
Lazy Output
FileOutputFormat
subclasses will create output (
part-r-
nnnnn
) files, even if they
are empty. Some applications prefer that empty files not be created, which is where
LazyOutputFormat
helps. It is a wrapper output format that ensures that the output
file is created only when the first record is emitted for a given partition. To use it, call its
setOutputFormatClass()
method with the
JobConf
and the underlying output
format.
Streaming supports a
-lazyOutput
option to enable
LazyOutputFormat
.
Database Output
The output formats for writing to relational databases and to HBase are mentioned in