MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

pendently. Line numbers are really a sequential notion. You have to keep a count of lines

as you consume them, so knowing the line number within a split would be possible, but

not within the file.

However, the offset within the file of each line is known by each split independently of

the other splits, since each split knows the size of the preceding splits and just adds this

onto the offsets within the split to produce a global file offset. The offset is usually suffi-

cient for applications that need a unique identifier for each line. Combined with the file's

name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calcu-

lating the line number is simply a matter of dividing the offset by the width.

THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS

BLOCKS

The logical records that FileInputFormat s define usually do not fit neatly into HDFS blocks. For

example, a TextInputFormat 's logical records are lines, which will cross HDFS boundaries more

often than not. This has no bearing on the functioning of your program — lines are not missed or broken,

for example — but it's worth knowing about because it does mean that data-local maps (that is, maps

that are running on the same host as their input data) will perform some remote reads. The slight over-

head this causes is not normally significant.

Figure 8-3 shows an example. A single file is broken into lines, and the line boundaries do not corres-

pond with the HDFS block boundaries. Splits honor logical record boundaries (in this case, lines), so we

see that the first split contains line 5, even though it spans the first and second block. The second split

starts at line 6.

Figure 8-3. Logical records and HDFS blocks for TextInputFormat

Controlling the maximum line length

If you are using one of the text input formats discussed here, you can set a maximum ex-

pected line length to safeguard against corrupted files. Corruption in a file can manifest it-

self as a very long line, which can cause out-of-memory errors and then task failure. By

setting mapreduce.input.linerecordreader.line.maxlength to a value

in bytes that fits in memory (and is comfortably greater than the length of lines in your in-

Search WWH ::

Custom Search

Home