Database Reference
In-Depth Information
pendently. Line numbers are really a sequential notion. You have to keep a count of lines
as you consume them, so knowing the line number within a split would be possible, but
not within the file.
However, the offset within the file of each line is known by each split independently of
the other splits, since each split knows the size of the preceding splits and just adds this
onto the offsets within the split to produce a global file offset. The offset is usually suffi-
cient for applications that need a unique identifier for each line. Combined with the file's
name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calcu-
lating the line number is simply a matter of dividing the offset by the width.
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS
BLOCKS
The logical records that FileInputFormat s define usually do not fit neatly into HDFS blocks. For
example, a TextInputFormat 's logical records are lines, which will cross HDFS boundaries more
often than not. This has no bearing on the functioning of your program — lines are not missed or broken,
for example — but it's worth knowing about because it does mean that data-local maps (that is, maps
that are running on the same host as their input data) will perform some remote reads. The slight over-
head this causes is not normally significant.
Figure 8-3 shows an example. A single file is broken into lines, and the line boundaries do not corres-
pond with the HDFS block boundaries. Splits honor logical record boundaries (in this case, lines), so we
see that the first split contains line 5, even though it spans the first and second block. The second split
starts at line 6.
Figure 8-3. Logical records and HDFS blocks for TextInputFormat
Controlling the maximum line length
If you are using one of the text input formats discussed here, you can set a maximum ex-
pected line length to safeguard against corrupted files. Corruption in a file can manifest it-
self as a very long line, which can cause out-of-memory errors and then task failure. By
setting mapreduce.input.linerecordreader.line.maxlength to a value
in bytes that fits in memory (and is comfortably greater than the length of lines in your in-
Search WWH ::




Custom Search