MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

62 62 62 62 62

hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64

64 64 64 64 64

hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66

66 66 66 66 66

The input files were named a , b , c , d , e , and f , and each contained 10 characters of the cor-

responding letter (so, for example, a contained 10 “a” characters), except e , which was

empty. We can see this in the textual rendering of the sequence files, which prints the file-

name followed by the hex representation of the file.

TIP

There's at least one way we could improve this program. As mentioned earlier, having one mapper per

file is inefficient, so subclassing CombineFileInputFormat instead of FileInputFormat

would be a better approach.

Text Input

Hadoop excels at processing unstructured text. In this section, we discuss the different

InputFormat s that Hadoop provides to process text.

TextInputFormat

TextInputFormat is the default InputFormat . Each record is a line of input. The

key, a LongWritable , is the byte offset within the file of the beginning of the line. The

value is the contents of the line, excluding any line terminators (e.g., newline or carriage

return), and is packaged as a Text object. So, a file containing the following text:

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

is divided into one split of four records. The records are interpreted as the following key-

value pairs:

(0, On the top of the Crumpetty Tree)

(33, The Quangle Wangle sat,)

(57, But his face you could not see,)

(89, On account of his Beaver Hat.)

Clearly, the keys are not line numbers. This would be impossible to implement in general,

in that a file is broken into splits at byte, not line, boundaries. Splits are processed inde-

Search WWH ::

Custom Search

Home