Database Reference
In-Depth Information
InputStream in
=
null
;
OutputStream out
=
null
;
try
{
in
=
codec
.
createInputStream
(
fs
.
open
(
inputPath
));
out
=
fs
.
create
(
new
Path
(
outputUri
));
IOUtils
.
copyBytes
(
in
,
out
,
conf
);
}
finally
{
IOUtils
.
closeStream
(
in
);
IOUtils
.
closeStream
(
out
);
}
}
}
Once the codec has been found, it is used to strip off the file suffix to form the output file-
name (via the
removeSuffix()
static method of
CompressionCodecFactory
).
In this way, a file named
file.gz
is decompressed to
file
by invoking the program as fol-
lows:
%
hadoop FileDecompressor file.gz
CompressionCodecFactory
loads all the codecs in
Table 5-2
, except LZO, as well
default, the property is empty; you would need to alter it only if you have a custom codec
that you wish to register (such as the externally hosted LZO codecs). Each codec knows
its default filename extension, thus permitting
CompressionCodecFactory
to
search through the registered codecs to find a match for the given extension (if any).
Table 5-3. Compression codec properties
Property name
Type
Default
value
Description
io.compression.codecs
Comma-separated
Class
names
A list of additional
CompressionCodec
classes for compression/decompression
Native libraries
For performance, it is preferable to use a native library for compression and decompres-
sion. For example, in one test, using the native gzip libraries reduced decompression times
by up to 50% and compression times by around 10% (compared to the built-in Java imple-
mentation).
Table 5-4
shows the availability of Java and native implementations for each
compression format. All formats have native implementations, but not all have a Java im-
plementation (LZO, for example).