MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

put data), you ensure that the record reader will skip the (long) corrupt lines without the

task failing.

KeyValueTextInputFormat

TextInputFormat 's keys, being simply the offsets within the file, are not normally

very useful. It is common for each line in a file to be a key-value pair, separated by a de-

limiter such as a tab character. For example, this is the kind of output produced by Tex-

tOutputFormat , Hadoop's default OutputFormat . To interpret such files correctly,

KeyValueTextInputFormat is appropriate.

You can specify the separator via the mapre-

duce.input.keyvaluelinerecordreader.key.value.separator prop-

erty. It is a tab character by default. Consider the following input file, where → represents

a (horizontal) tab character:

line1→On the top of the Crumpetty Tree

line2→The Quangle Wangle sat,

line3→But his face you could not see,

line4→On account of his Beaver Hat.

Like in the TextInputFormat case, the input is in a single split comprising four re-

cords, although this time the keys are the Text sequences before the tab in each line:

(line1, On the top of the Crumpetty Tree)

(line2, The Quangle Wangle sat,)

(line3, But his face you could not see,)

(line4, On account of his Beaver Hat.)

NLineInputFormat

With TextInputFormat and KeyValueTextInputFormat , each mapper receives

a variable number of lines of input. The number depends on the size of the split and the

length of the lines. If you want your mappers to receive a fixed number of lines of input,

then NLineInputFormat is the InputFormat to use. Like with TextIn-

putFormat , the keys are the byte offsets within the file and the values are the lines

themselves.

N refers to the number of lines of input that each mapper receives. With N set to 1 (the de-

fault), each mapper receives exactly one line of input. The

mapreduce.input.lineinputformat.linespermap property controls the

value of N. By way of example, consider these four lines again:

Search WWH ::

Custom Search

Home