Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

public static void main ( String [] args ) throws Exception {

String codecClassname = args [ 0 ];

Class <?> codecClass = Class . forName ( codecClassname );

Configuration conf = new Configuration ();

CompressionCodec codec = ( CompressionCodec )

ReflectionUtils . newInstance ( codecClass , conf );

Compressor compressor = null ;

try {

compressor = CodecPool . getCompressor ( codec );

CompressionOutputStream out =

codec . createOutputStream ( System . out , compressor );

IOUtils . copyBytes ( System . in , out , 4096 , false );

out . finish ();

} finally {

CodecPool . returnCompressor ( compressor );

}

We retrieve a Compressor instance from the pool for a given CompressionCodec ,

which we use in the codec's overloaded createOutputStream() method. By using a

finally block, we ensure that the compressor is returned to the pool even if there is an

IOException while copying the bytes between the streams.

Compression and Input Splits

When considering how to compress data that will be processed by MapReduce, it is im-

portant to understand whether the compression format supports splitting. Consider an un-

compressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 128

MB, the file will be stored as eight blocks, and a MapReduce job using this file as input

will create eight input splits, each processed independently as input to a separate map

task.

Imagine now that the file is a gzip-compressed file whose compressed size is 1 GB. As

before, HDFS will store the file as eight blocks. However, creating a split for each block

won't work, because it is impossible to start reading at an arbitrary point in the gzip

stream and therefore impossible for a map task to read its split independently of the oth-

ers. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores

data as a series of compressed blocks. The problem is that the start of each block is not

distinguished in any way that would allow a reader positioned at an arbitrary point in the

stream to advance to the beginning of the next block, thereby synchronizing itself with the

stream. For this reason, gzip does not support splitting.

Search WWH ::

Custom Search

Home