Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

sion speed, but it is still slower than the other formats. LZO, LZ4, and Snappy, on the oth-

er hand, all optimize for speed and are around an order of magnitude faster than gzip, but

compress less effectively. Snappy and LZ4 are also significantly faster than LZO for de-

compression. [ 44 ]

The “Splittable” column in Table 5-1 indicates whether the compression format supports

splitting (that is, whether you can seek to any point in the stream and start reading from

some point further on). Splittable compression formats are especially suitable for MapRe-

duce; see Compression and Input Splits for further discussion.

Codecs

A codec is the implementation of a compression-decompression algorithm. In Hadoop, a

codec is represented by an implementation of the CompressionCodec interface. So,

for example, GzipCodec encapsulates the compression and decompression algorithm

for gzip. Table 5-2 lists the codecs that are available for Hadoop.

Table 5-2. Hadoop compression codecs

Compression format Hadoop CompressionCodec

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

gzip

org.apache.hadoop.io.compress.GzipCodec

bzip2

org.apache.hadoop.io.compress.BZip2Codec

LZO

com.hadoop.compression.lzo.LzopCodec

LZ4

org.apache.hadoop.io.compress.Lz4Codec

Snappy

org.apache.hadoop.io.compress.SnappyCodec

The LZO libraries are GPL licensed and may not be included in Apache distributions, so

for this reason the Hadoop codecs must be downloaded separately from Google (or

GitHub , which includes bug fixes and more tools). The LzopCodec , which is compat-

ible with the lzop tool, is essentially the LZO format with extra headers, and is the one

you normally want. There is also an LzoCodec for the pure LZO format, which uses the

.lzo_deflate filename extension (by analogy with DEFLATE, which is gzip without the

headers).

Search WWH ::

Custom Search

Home