Database Reference
In-Depth Information
Input Formats
Hadoop can process many different types of data formats, from flat text files to databases.
In this section, we explore the different formats available.
Input Splits and Records
As we saw in Chapter 2 , an input split is a chunk of the input that is processed by a single
map. Each map processes a single split. Each split is divided into records, and the map pro-
cesses each record — a key-value pair — in turn. Splits and records are logical: there is
nothing that requires them to be tied to files, for example, although in their most common
incarnations, they are. In a database context, a split might correspond to a range of rows
from a table and a record to a row in that range (this is precisely the case with DBIn-
putFormat , which is an input format for reading data from a relational database).
Input splits are represented by the Java class InputSplit (which, like all of the classes
mentioned in this section, is in the org.apache.hadoop.mapreduce package): [ 55 ]
public abstract class InputSplit {
public abstract long getLength () throws IOException ,
InterruptedException ;
public abstract String [] getLocations () throws IOException ,
InterruptedException ;
}
An InputSplit has a length in bytes and a set of storage locations, which are just host-
name strings. Notice that a split doesn't contain the input data; it is just a reference to the
data. The storage locations are used by the MapReduce system to place map tasks as close
to the split's data as possible, and the size is used to order the splits so that the largest get
processed first, in an attempt to minimize the job runtime (this is an instance of a greedy
approximation algorithm).
As a MapReduce application writer, you don't need to deal with InputSplit s directly,
as they are created by an InputFormat (an InputFormat is responsible for creating
the input splits and dividing them into records). Before we see some concrete examples of
InputFormat s, let's briefly examine how it is used in MapReduce. Here's the interface:
public abstract class InputFormat < K , V > {
public abstract List < InputSplit > getSplits ( JobContext context )
throws IOException , InterruptedException ;
public abstract RecordReader < K , V >
createRecordReader ( InputSplit split , TaskAttemptContext
Search WWH ::




Custom Search