Database Reference
In-Depth Information
public static void main ( String [] args ) throws Exception {
String codecClassname = args [ 0 ];
Class <?> codecClass = Class . forName ( codecClassname );
Configuration conf = new Configuration ();
CompressionCodec codec = ( CompressionCodec )
ReflectionUtils . newInstance ( codecClass , conf );
Compressor compressor = null ;
try {
compressor = CodecPool . getCompressor ( codec );
CompressionOutputStream out =
codec . createOutputStream ( System . out , compressor );
IOUtils . copyBytes ( System . in , out , 4096 , false );
out . finish ();
} finally {
CodecPool . returnCompressor ( compressor );
}
}
}
We retrieve a Compressor instance from the pool for a given CompressionCodec ,
which we use in the codec's overloaded createOutputStream() method. By using a
finally block, we ensure that the compressor is returned to the pool even if there is an
IOException while copying the bytes between the streams.
Compression and Input Splits
When considering how to compress data that will be processed by MapReduce, it is im-
portant to understand whether the compression format supports splitting. Consider an un-
compressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 128
MB, the file will be stored as eight blocks, and a MapReduce job using this file as input
will create eight input splits, each processed independently as input to a separate map
task.
Imagine now that the file is a gzip-compressed file whose compressed size is 1 GB. As
before, HDFS will store the file as eight blocks. However, creating a split for each block
won't work, because it is impossible to start reading at an arbitrary point in the gzip
stream and therefore impossible for a map task to read its split independently of the oth-
ers. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores
data as a series of compressed blocks. The problem is that the start of each block is not
distinguished in any way that would allow a reader positioned at an arbitrary point in the
stream to advance to the beginning of the next block, thereby synchronizing itself with the
stream. For this reason, gzip does not support splitting.
Search WWH ::




Custom Search