Database Reference
In-Depth Information
sion speed, but it is still slower than the other formats. LZO, LZ4, and Snappy, on the oth-
er hand, all optimize for speed and are around an order of magnitude faster than gzip, but
compress less effectively. Snappy and LZ4 are also significantly faster than LZO for de-
compression.
[
44
]
The “Splittable” column in
Table 5-1
indicates whether the compression format supports
splitting (that is, whether you can seek to any point in the stream and start reading from
some point further on). Splittable compression formats are especially suitable for MapRe-
duce; see
Compression and Input Splits
for further discussion.
Codecs
A
codec
is the implementation of a compression-decompression algorithm. In Hadoop, a
codec is represented by an implementation of the
CompressionCodec
interface. So,
for example,
GzipCodec
encapsulates the compression and decompression algorithm
for gzip.
Table 5-2
lists the codecs that are available for Hadoop.
Table 5-2. Hadoop compression codecs
Compression format Hadoop CompressionCodec
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
gzip
org.apache.hadoop.io.compress.GzipCodec
bzip2
org.apache.hadoop.io.compress.BZip2Codec
LZO
com.hadoop.compression.lzo.LzopCodec
LZ4
org.apache.hadoop.io.compress.Lz4Codec
Snappy
org.apache.hadoop.io.compress.SnappyCodec
The LZO libraries are GPL licensed and may not be included in Apache distributions, so
for this reason the Hadoop codecs must be downloaded separately from
Google
(or
ible with the
lzop
tool, is essentially the LZO format with extra headers, and is the one
you normally want. There is also an
LzoCodec
for the pure LZO format, which uses the
.lzo_deflate
filename extension (by analogy with DEFLATE, which is gzip without the
headers).