Database Reference
In-Depth Information
62 62 62 62 62
hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64
64 64 64 64 64
hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66
66 66 66 66 66
The input files were named a , b , c , d , e , and f , and each contained 10 characters of the cor-
responding letter (so, for example, a contained 10 “a” characters), except e , which was
empty. We can see this in the textual rendering of the sequence files, which prints the file-
name followed by the hex representation of the file.
TIP
There's at least one way we could improve this program. As mentioned earlier, having one mapper per
file is inefficient, so subclassing CombineFileInputFormat instead of FileInputFormat
would be a better approach.
Text Input
Hadoop excels at processing unstructured text. In this section, we discuss the different
InputFormat s that Hadoop provides to process text.
TextInputFormat
TextInputFormat is the default InputFormat . Each record is a line of input. The
key, a LongWritable , is the byte offset within the file of the beginning of the line. The
value is the contents of the line, excluding any line terminators (e.g., newline or carriage
return), and is packaged as a Text object. So, a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is divided into one split of four records. The records are interpreted as the following key-
value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Clearly, the keys are not line numbers. This would be impossible to implement in general,
in that a file is broken into splits at byte, not line, boundaries. Splits are processed inde-
Search WWH ::




Custom Search