Database Reference
In-Depth Information
Compression
File compression brings two major benefits: it reduces the space needed to store files, and it
speeds up data transfer across the network or to or from disk. When dealing with large
volumes of data, both of these savings can be significant, so it pays to carefully consider
how to use compression in Hadoop.
There are many different compression formats, tools, and algorithms, each with different
characteristics. Table 5-1 lists some of the more common ones that can be used with Ha-
doop.
Table 5-1. A summary of compression formats
Compression format
Tool
Algorithm
Filename extension
Splittable?
DEFLATE [ a ]
N/A
DEFLATE
.deflate
No
gzip
gzip
DEFLATE
.gz
No
bzip2
bzip2
bzip2
.bz2
Yes
No [ b ]
LZO
lzop
LZO
.lzo
LZ4
N/A
LZ4
.lz4
No
Snappy
N/A
Snappy
.snappy
No
[ a ] DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available
command-line tool for producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is
DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention.
[ b ] However, LZO files are splittable if they have been indexed in a preprocessing step. See Compression and Input
Splits .
All compression algorithms exhibit a space/time trade-off: faster compression and decom-
pression speeds usually come at the expense of smaller space savings. The tools listed in
Table 5-1 typically give some control over this trade-off at compression time by offering
nine different options: -1 means optimize for speed, and -9 means optimize for space. For
example, the following command creates a compressed file file.gz using the fastest com-
pression method:
% gzip -1 file
The different tools have very different compression characteristics. gzip is a general-pur-
pose compressor and sits in the middle of the space/time trade-off. bzip2 compresses more
effectively than gzip, but is slower. bzip2's decompression speed is faster than its compres-
Search WWH ::




Custom Search