Database Reference
In-Depth Information
and MapReduce programs) by using the TextInputFormat input format
instead of the Hadoop standard.
Compression and Decompression Speeds
The old saying “nothing in this world is free” is surely true when it comes to
compression. There's no magic going on; in essence, you are simply consum-
ing CPU cycles to save disk space. So let's start with this assumption: There
could be a performance penalty for compressing data in your Hadoop clus-
ter, because when data is written to the cluster, the compression algorithms
(which are CPU-intensive) need CPU cycles and time to compress the data.
Likewise, when reading data, any MapReduce workloads against com-
pressed data can incur a performance penalty because of the CPU cycles and
the time required to decompress the compressed data. This creates a conun-
drum: You need to balance priorities between storage savings and additional
performance overhead.
We should note that if you've got an application that's I/O bound (typical
for many warehouse-style applications), you're likely to see a performance
gain in your application, because I/O-bound systems typically have spare
CPU cycles (found as idle I/O waits in the CPU) that can be used to run
the compression and decompression algorithms. For example, if you use idle
I/O wait CPU cycles to do the compression, and you get good compression
rates, you could end up with more data flowing through the I/O pipe, and
that means faster performance for those applications that need to fetch a lot of
data from disk.
A BigInsights Bonus: IBM CMX Compression
BigInsights includes the IBM CMX compression (an IBM version of the LZO
compression codec), which supports splitting compressed files and enabling
individual compressed splits to be processed in parallel by your MapReduce
jobs.
Some Hadoop online forums describe how to use the GNU version of
LZO to enable splittable compression, so why did IBM create a version of it,
and why not use the GNU LZO alternative? First, the IBM CMX compression
codec does not create an index while compressing a file, because it uses
Search WWH ::




Custom Search