Database Reference
In-Depth Information
Input Formats
Hadoop can process many different types of data formats, from flat text files to databases.
In this section, we explore the different formats available.
Input Splits and Records
As we saw in
Chapter 2
, an input split is a chunk of the input that is processed by a single
map. Each map processes a single split. Each split is divided into records, and the map pro-
cesses each record — a key-value pair — in turn. Splits and records are logical: there is
nothing that requires them to be tied to files, for example, although in their most common
incarnations, they are. In a database context, a split might correspond to a range of rows
from a table and a record to a row in that range (this is precisely the case with
DBIn-
putFormat
, which is an input format for reading data from a relational database).
Input splits are represented by the Java class
InputSplit
(which, like all of the classes
public abstract class
InputSplit
{
public abstract
long
getLength
()
throws
IOException
,
InterruptedException
;
public abstract
String
[]
getLocations
()
throws
IOException
,
InterruptedException
;
}
An
InputSplit
has a length in bytes and a set of storage locations, which are just host-
name strings. Notice that a split doesn't contain the input data; it is just a reference to the
data. The storage locations are used by the MapReduce system to place map tasks as close
to the split's data as possible, and the size is used to order the splits so that the largest get
processed first, in an attempt to minimize the job runtime (this is an instance of a greedy
approximation algorithm).
As a MapReduce application writer, you don't need to deal with
InputSplit
s directly,
as they are created by an
InputFormat
(an
InputFormat
is responsible for creating
the input splits and dividing them into records). Before we see some concrete examples of
InputFormat
s, let's briefly examine how it is used in MapReduce. Here's the interface:
public abstract class
InputFormat
<
K
,
V
> {
public abstract
List
<
InputSplit
>
getSplits
(
JobContext context
)
throws
IOException
,
InterruptedException
;
public abstract
RecordReader
<
K
,
V
>
createRecordReader
(
InputSplit split
,
TaskAttemptContext