Database Reference
In-Depth Information
sion speed, but it is still slower than the other formats. LZO, LZ4, and Snappy, on the oth-
er hand, all optimize for speed and are around an order of magnitude faster than gzip, but
compress less effectively. Snappy and LZ4 are also significantly faster than LZO for de-
compression. [ 44 ]
The “Splittable” column in Table 5-1 indicates whether the compression format supports
splitting (that is, whether you can seek to any point in the stream and start reading from
some point further on). Splittable compression formats are especially suitable for MapRe-
duce; see Compression and Input Splits for further discussion.
Codecs
A codec is the implementation of a compression-decompression algorithm. In Hadoop, a
codec is represented by an implementation of the CompressionCodec interface. So,
for example, GzipCodec encapsulates the compression and decompression algorithm
for gzip. Table 5-2 lists the codecs that are available for Hadoop.
Table 5-2. Hadoop compression codecs
Compression format Hadoop CompressionCodec
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
gzip
org.apache.hadoop.io.compress.GzipCodec
bzip2
org.apache.hadoop.io.compress.BZip2Codec
LZO
com.hadoop.compression.lzo.LzopCodec
LZ4
org.apache.hadoop.io.compress.Lz4Codec
Snappy
org.apache.hadoop.io.compress.SnappyCodec
The LZO libraries are GPL licensed and may not be included in Apache distributions, so
for this reason the Hadoop codecs must be downloaded separately from Google (or
GitHub , which includes bug fixes and more tools). The LzopCodec , which is compat-
ible with the lzop tool, is essentially the LZO format with extra headers, and is the one
you normally want. There is also an LzoCodec for the pure LZO format, which uses the
.lzo_deflate filename extension (by analogy with DEFLATE, which is gzip without the
headers).
Search WWH ::




Custom Search