Database Reference
In-Depth Information
In this case, MapReduce will do the right thing and not try to split the gzipped file, since it
knows that the input is gzip-compressed (by looking at the filename extension) and that
gzip does not support splitting. This will work, but at the expense of locality: a single map
will process the eight HDFS blocks, most of which will not be local to the map. Also,
with fewer maps, the job is less granular and so may take longer to run.
If the file in our hypothetical example were an LZO file, we would have the same problem
because the underlying compression format does not provide a way for a reader to syn-
chronize itself with the stream. However, it is possible to preprocess LZO files using an
indexer tool that comes with the Hadoop LZO libraries, which you can obtain from the
Google and GitHub sites listed in Codecs . The tool builds an index of split points, effect-
ively making them splittable when the appropriate MapReduce input format is used.
A bzip2 file, on the other hand, does provide a synchronization marker between blocks (a
48-bit approximation of pi), so it does support splitting. ( Table 5-1 lists whether each
compression format supports splitting.)
WHICH COMPRESSION FORMAT SHOULD I USE?
Hadoop applications process large datasets, so you should strive to take advantage of compression.
Which compression format you use depends on such considerations as file size, format, and the tools you
are using for processing. Here are some suggestions, arranged roughly in order of most to least effective:
▪ Use a container file format such as sequence files (see the section), Avro datafiles (see the sec-
tion), ORCFiles (see the section), or Parquet files (see the section), all of which support both
compression and splitting. A fast compressor such as LZO, LZ4, or Snappy is generally a good
choice.
▪ Use a compression format that supports splitting, such as bzip2 (although bzip2 is fairly slow),
or one that can be indexed to support splitting, such as LZO.
▪ Split the file into chunks in the application, and compress each chunk separately using any
supported compression format (it doesn't matter whether it is splittable). In this case, you
should choose the chunk size so that the compressed chunks are approximately the size of an
HDFS block.
▪ Store the files uncompressed.
For large files, you should not use a compression format that does not support splitting on the whole file,
because you lose locality and make MapReduce applications very inefficient.
Using Compression in MapReduce
As described in Inferring CompressionCodecs using CompressionCodecFactory , if your
input files are compressed, they will be decompressed automatically as they are read by
MapReduce, using the filename extension to determine which codec to use.
Search WWH ::




Custom Search