IBM’s Enterprise Hadoop: InfoSphere BigInsights - Harness the Power of Big Data

Database Reference

In-Depth Information

and MapReduce programs) by using the TextInputFormat input format

instead of the Hadoop standard.

Compression and Decompression Speeds

The old saying “nothing in this world is free” is surely true when it comes to

compression. There's no magic going on; in essence, you are simply consum-

ing CPU cycles to save disk space. So let's start with this assumption: There

could be a performance penalty for compressing data in your Hadoop clus-

ter, because when data is written to the cluster, the compression algorithms

(which are CPU-intensive) need CPU cycles and time to compress the data.

Likewise, when reading data, any MapReduce workloads against com-

pressed data can incur a performance penalty because of the CPU cycles and

the time required to decompress the compressed data. This creates a conun-

drum: You need to balance priorities between storage savings and additional

performance overhead.

We should note that if you've got an application that's I/O bound (typical

for many warehouse-style applications), you're likely to see a performance

gain in your application, because I/O-bound systems typically have spare

CPU cycles (found as idle I/O waits in the CPU) that can be used to run

the compression and decompression algorithms. For example, if you use idle

I/O wait CPU cycles to do the compression, and you get good compression

rates, you could end up with more data flowing through the I/O pipe, and

that means faster performance for those applications that need to fetch a lot of

data from disk.

A BigInsights Bonus: IBM CMX Compression

BigInsights includes the IBM CMX compression (an IBM version of the LZO

compression codec), which supports splitting compressed files and enabling

individual compressed splits to be processed in parallel by your MapReduce

jobs.

Some Hadoop online forums describe how to use the GNU version of

LZO to enable splittable compression, so why did IBM create a version of it,

and why not use the GNU LZO alternative? First, the IBM CMX compression

codec does not create an index while compressing a file, because it uses

Search WWH ::

Custom Search

Home