Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

output, which is compressed by the CompressionOutputStream . Finally, we call

finish() on CompressionOutputStream , which tells the compressor to finish

writing to the compressed stream, but doesn't close the stream. We can try it out with the

following command line, which compresses the string “Text” using the StreamCom-

pressor program with the GzipCodec , then decompresses it from standard input us-

ing gunzip :

% echo "Text" | hadoop StreamCompressor

org.apache.hadoop.io.compress.GzipCodec \

| gunzip -

Text

Inferring CompressionCodecs using CompressionCodecFactory

If you are reading a compressed file, normally you can infer which codec to use by look-

ing at its filename extension. A file ending in .gz can be read with GzipCodec , and so

on. The extensions for each compression format are listed in Table 5-1 .

CompressionCodecFactory provides a way of mapping a filename extension to a

CompressionCodec using its getCodec() method, which takes a Path object for

the file in question. Example 5-2 shows an application that uses this feature to decompress

files.

Example 5-2. A program to decompress a compressed file using a codec inferred from the

file's extension

public class FileDecompressor {

public static void main ( String [] args ) throws Exception {

String uri = args [ 0 ];

Configuration conf = new Configuration ();

FileSystem fs = FileSystem . get ( URI . create ( uri ), conf );

Path inputPath = new Path ( uri );

CompressionCodecFactory factory = new CompressionCodecFactory ( conf );

CompressionCodec codec = factory . getCodec ( inputPath );

if ( codec == null ) {

System . err . println ( "No codec found for " + uri );

System . exit ( 1 );

}

String outputUri =

CompressionCodecFactory . removeSuffix ( uri ,

codec . getDefaultExtension ());

Search WWH ::

Custom Search

Home