Database Reference
In-Depth Information
FixedLengthInputFormat
FixedLengthInputFormat
is for reading fixed-width binary records from a file,
when the records are not separated by delimiters. The record size must be set via
fix-
edlengthinputformat.record.length
.
Multiple Inputs
Although the input to a MapReduce job may consist of multiple input files (constructed
by a combination of file globs, filters, and plain paths), all of the input is interpreted by a
single
InputFormat
and a single
Mapper
. What often happens, however, is that the
data format evolves over time, so you have to write your mapper to cope with all of your
legacy formats. Or you may have data sources that provide the same type of data but in
different formats. This arises in the case of performing joins of different datasets; see
Reduce-Side Joins
.
For instance, one might be tab-separated plain text, and the other a
binary sequence file. Even if they are in the same format, they may have different repres-
entations, and therefore need to be parsed differently.
These cases are handled elegantly by using the
MultipleInputs
class, which allows
you to specify which
InputFormat
and
Mapper
to use on a per-path basis. For ex-
with the NCDC data for our maximum temperature analysis, we might set up the input as
follows:
MultipleInputs
.
addInputPath
(
job
,
ncdcInputPath
,
TextInputFormat
.
class
,
MaxTemperatureMapper
.
class
);
MultipleInputs
.
addInputPath
(
job
,
metOfficeInputPath
,
TextInputFormat
.
class
,
MetOfficeMaxTemperatureMapper
.
class
);
This code replaces the usual calls to
FileInputFormat.addInputPath()
and
job.setMapperClass()
. Both the Met Office and NCDC data are text based, so we
use
TextInputFormat
for each. But the line format of the two data sources is differ-
ent, so we use two different mappers. The
MaxTemperatureMapper
reads NCDC in-
put data and extracts the year and temperature fields. The
MetOfficeMaxTemperat-
ureMapper
reads Met Office input data and extracts the year and temperature fields.
The important thing is that the map outputs have the same types, since the reducers
(which are all of the same type) see the aggregated map outputs and are not aware of the
different mappers used to produce them.
The
MultipleInputs
class has an overloaded version of
addInputPath()
that
doesn't take a mapper: