Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

In this case, MapReduce will do the right thing and not try to split the gzipped file, since it

knows that the input is gzip-compressed (by looking at the filename extension) and that

gzip does not support splitting. This will work, but at the expense of locality: a single map

will process the eight HDFS blocks, most of which will not be local to the map. Also,

with fewer maps, the job is less granular and so may take longer to run.

If the file in our hypothetical example were an LZO file, we would have the same problem

because the underlying compression format does not provide a way for a reader to syn-

chronize itself with the stream. However, it is possible to preprocess LZO files using an

indexer tool that comes with the Hadoop LZO libraries, which you can obtain from the

Google and GitHub sites listed in Codecs . The tool builds an index of split points, effect-

ively making them splittable when the appropriate MapReduce input format is used.

A bzip2 file, on the other hand, does provide a synchronization marker between blocks (a

48-bit approximation of pi), so it does support splitting. ( Table 5-1 lists whether each

compression format supports splitting.)

WHICH COMPRESSION FORMAT SHOULD I USE?

Hadoop applications process large datasets, so you should strive to take advantage of compression.

Which compression format you use depends on such considerations as file size, format, and the tools you

are using for processing. Here are some suggestions, arranged roughly in order of most to least effective:

▪ Use a container file format such as sequence files (see the section), Avro datafiles (see the sec-

tion), ORCFiles (see the section), or Parquet files (see the section), all of which support both

compression and splitting. A fast compressor such as LZO, LZ4, or Snappy is generally a good

choice.

▪ Use a compression format that supports splitting, such as bzip2 (although bzip2 is fairly slow),

or one that can be indexed to support splitting, such as LZO.

▪ Split the file into chunks in the application, and compress each chunk separately using any

supported compression format (it doesn't matter whether it is splittable). In this case, you

should choose the chunk size so that the compressed chunks are approximately the size of an

HDFS block.

▪ Store the files uncompressed.

For large files, you should not use a compression format that does not support splitting on the whole file,

because you lose locality and make MapReduce applications very inefficient.

Using Compression in MapReduce

As described in Inferring CompressionCodecs using CompressionCodecFactory , if your

input files are compressed, they will be decompressed automatically as they are read by

MapReduce, using the filename extension to determine which codec to use.

Search WWH ::

Custom Search

Home