Database Reference
In-Depth Information
output, which is compressed by the
CompressionOutputStream
. Finally, we call
finish()
on
CompressionOutputStream
, which tells the compressor to finish
writing to the compressed stream, but doesn't close the stream. We can try it out with the
following command line, which compresses the string “Text” using the
StreamCom-
pressor
program with the
GzipCodec
, then decompresses it from standard input us-
ing
gunzip
:
%
echo "Text" | hadoop StreamCompressor
org.apache.hadoop.io.compress.GzipCodec \
| gunzip -
Text
Inferring CompressionCodecs using CompressionCodecFactory
If you are reading a compressed file, normally you can infer which codec to use by look-
ing at its filename extension. A file ending in
.gz
can be read with
GzipCodec
, and so
on. The extensions for each compression format are listed in
Table 5-1
.
CompressionCodecFactory
provides a way of mapping a filename extension to a
CompressionCodec
using its
getCodec()
method, which takes a
Path
object for
the file in question.
Example 5-2
shows an application that uses this feature to decompress
files.
Example 5-2. A program to decompress a compressed file using a codec inferred from the
file's extension
public class
FileDecompressor
{
public static
void
main
(
String
[]
args
)
throws
Exception
{
String uri
=
args
[
0
];
Configuration conf
=
new
Configuration
();
FileSystem fs
=
FileSystem
.
get
(
URI
.
create
(
uri
),
conf
);
Path inputPath
=
new
Path
(
uri
);
CompressionCodecFactory factory
=
new
CompressionCodecFactory
(
conf
);
CompressionCodec codec
=
factory
.
getCodec
(
inputPath
);
if
(
codec
==
null
) {
System
.
err
.
println
(
"No codec found for "
+
uri
);
System
.
exit
(
1
);
}
String outputUri
=
CompressionCodecFactory
.
removeSuffix
(
uri
,
codec
.
getDefaultExtension
());