Database Reference
In-Depth Information
representing the split and block boundaries (in this case, the split and block
size are the same).
When files (especially text files) are compressed, complications arise. For
most compression algorithms, individual file splits cannot be decompressed
independently of other splits from the same file. More specifically, these com-
pression algorithms are not splittable (remember this key term when discuss-
ing compression and Hadoop). In the current production release of Hadoop
(1.0.3 at the time of writing), no support is provided for splitting compressed
text files. For files in which the Sequence or Avro formats are applied, this is
not an issue, because these formats have built-in synchronization points, and
are therefore splittable. For unsplittable compressed text files, MapReduce
processing is limited to a single mapper.
For example, suppose that the file in Figure 5-8 is a 1GB text file in your
Hadoop cluster, and that your block size is set to the BigInsights default of
128MB, which means that your file spans eight blocks. When this file is com-
pressed using the conventional algorithms available in Hadoop, it's no lon-
ger possible to parallelize the processing for each of the compressed file
splits, because the file can be decompressed only as a whole, and not as indi-
vidual parts based on the splits. Figure 5-9 depicts this file in a compressed
(and binary) state, with the splits being impossible to decompress individu-
ally. Notice the mismatch? (The split boundaries are dotted lines, and the
block boundaries are solid lines.)
Because Hadoop 1.0.3 doesn't support splittable text compression natively, all
the splits for a compressed text file are processed by only a single mapper. For
many workloads, this would cause such a significant performance hit that com-
pression wouldn't be a viable option. However, Jaql is configured to understand
splittable compression for text files and will process them automatically with
parallel mappers. You can do this manually for other environments (such as Pig
0001 1010 0001 1101 1100 0100 1010 1110 0101 1100 1101 0011 0001 1010 0001 1101 1100 0101 1100
Figure 5-9 A compressed unsplittable ile
Search WWH ::




Custom Search