Database Reference
In-Depth Information
fixed-length compression blocks. In contrast, the GNU LZO algorithm uses
variable-length compression blocks, which leads to the added complexity of
needing an index file that tells the mapper where it can safely split a com-
pressed file. (For GNU LZO compression, this means that mappers would
need to perform index look-ups during decompression and read operations.
There is administrative overhead with this index, because if you move the
compressed file, you will need to move the corresponding index file as well.)
Second, many companies, including IBM, have legal policies that prevent
them from purchasing or releasing software that includes GNU Public
License (GPL) components. This means that the approach that is described in
online Hadoop forums requires additional administrative overhead and con-
figuration work. In addition, there are businesses with policies restricting the
deployment of GPL code. The IBM CMX compression is fully integrated with
BigInsights and under the same enterprise-friendly license agreement as the
rest of BigInsights, which means that you can use it with less hassle and none
of the complications associated with the GPL alternative.
In a future release of Hadoop, the bzip2 algorithm will support splitting.
However, decompression speed for bzip2 is much slower than for IBM
CMX, so bzip2 is not a desirable compression algorithm for workloads
where performance is important.
Figure 5-10 shows the compressed text file from the earlier examples, but
in a splittable state, where individual splits can be decompressed by their
own mappers. Note that the split sizes are equal, indicating fixed-length
compression blocks.
In the following table, you can see the four compression algorithms that
are available on the BigInsights platform ( IBM CMX , bzip2 , gzip , and
DEFLATE ) and some of their characteristics.
Compression
Codec
Degreeof
Compression
Decompression
Speed
FileExtension
Splittable
IBM CMX
.cmx
Yes
Medium
Fastest
bzip2
.bz2
Yes, but not yet
available
Highest
Slow
Gzip
.gz
No
High
Fast
DEFLATE
.delate
No
High
Fast
Search WWH ::




Custom Search