Database Reference
In-Depth Information
FixedLengthInputFormat
FixedLengthInputFormat is for reading fixed-width binary records from a file,
when the records are not separated by delimiters. The record size must be set via fix-
edlengthinputformat.record.length .
Multiple Inputs
Although the input to a MapReduce job may consist of multiple input files (constructed
by a combination of file globs, filters, and plain paths), all of the input is interpreted by a
single InputFormat and a single Mapper . What often happens, however, is that the
data format evolves over time, so you have to write your mapper to cope with all of your
legacy formats. Or you may have data sources that provide the same type of data but in
different formats. This arises in the case of performing joins of different datasets; see
Reduce-Side Joins . For instance, one might be tab-separated plain text, and the other a
binary sequence file. Even if they are in the same format, they may have different repres-
entations, and therefore need to be parsed differently.
These cases are handled elegantly by using the MultipleInputs class, which allows
you to specify which InputFormat and Mapper to use on a per-path basis. For ex-
ample, if we had weather data from the UK Met Office [ 59 ] that we wanted to combine
with the NCDC data for our maximum temperature analysis, we might set up the input as
follows:
MultipleInputs . addInputPath ( job , ncdcInputPath ,
TextInputFormat . class , MaxTemperatureMapper . class );
MultipleInputs . addInputPath ( job , metOfficeInputPath ,
TextInputFormat . class , MetOfficeMaxTemperatureMapper . class );
This code replaces the usual calls to FileInputFormat.addInputPath() and
job.setMapperClass() . Both the Met Office and NCDC data are text based, so we
use TextInputFormat for each. But the line format of the two data sources is differ-
ent, so we use two different mappers. The MaxTemperatureMapper reads NCDC in-
put data and extracts the year and temperature fields. The MetOfficeMaxTemperat-
ureMapper reads Met Office input data and extracts the year and temperature fields.
The important thing is that the map outputs have the same types, since the reducers
(which are all of the same type) see the aggregated map outputs and are not aware of the
different mappers used to produce them.
The MultipleInputs class has an overloaded version of addInputPath() that
doesn't take a mapper:
Search WWH ::




Custom Search