Database Reference
In-Depth Information
put data), you ensure that the record reader will skip the (long) corrupt lines without the
task failing.
KeyValueTextInputFormat
TextInputFormat 's keys, being simply the offsets within the file, are not normally
very useful. It is common for each line in a file to be a key-value pair, separated by a de-
limiter such as a tab character. For example, this is the kind of output produced by Tex-
tOutputFormat , Hadoop's default OutputFormat . To interpret such files correctly,
KeyValueTextInputFormat is appropriate.
You can specify the separator via the mapre-
duce.input.keyvaluelinerecordreader.key.value.separator prop-
erty. It is a tab character by default. Consider the following input file, where → represents
a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four re-
cords, although this time the keys are the Text sequences before the tab in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
NLineInputFormat
With TextInputFormat and KeyValueTextInputFormat , each mapper receives
a variable number of lines of input. The number depends on the size of the split and the
length of the lines. If you want your mappers to receive a fixed number of lines of input,
then NLineInputFormat is the InputFormat to use. Like with TextIn-
putFormat , the keys are the byte offsets within the file and the values are the lines
themselves.
N refers to the number of lines of input that each mapper receives. With N set to 1 (the de-
fault), each mapper receives exactly one line of input. The
mapreduce.input.lineinputformat.linespermap property controls the
value of N. By way of example, consider these four lines again:
Search WWH ::




Custom Search