Database Reference
In-Depth Information
formats, we can specify a compression codec that will compress the data. As we have
already seen, Spark's native input formats ( textFile and sequenceFile ) can auto‐
matically handle some types of compression for us. When you're reading in com‐
pressed data, there are some compression codecs that can be used to automatically
guess the compression type.
These compression options apply only to the Hadoop formats that support compres‐
sion, namely those that are written out to a filesystem. The database Hadoop formats
generally do not implement support for compression, or if they have compressed
records that is configured in the database itself.
Choosing an output compression codec can have a big impact on future users of the
data. With distributed systems such as Spark, we normally try to read our data in
from multiple different machines. To make this possible, each worker needs to be
able to find the start of a new record. Some compression formats make this impossi‐
ble, which requires a single node to read in all of the data and thus can easily lead to a
bottleneck. Formats that can be easily read from multiple machines are called “split‐
table.” Table 5-3 lists the available compression options.
Table 5-3. Compression options
Format
Splittable
Average
compression
speed
Effectiveness
on text
Hadoop compression codec
Pure
Java
Native
Comments
gzip
N
Fast
High
Y
Y
org.apache.hadoop.io.com
press.GzipCodec
lzo
Y 6
Very fast
Medium
Y
Y
LZO requires
installation
on every
worker node
com.hadoop.compres
sion.lzo.LzoCodec
bzip2
Y
Slow
Very high
Y
Y
Uses pure
Java for
splittable
version
org.apache.hadoop.io.com
press.BZip2Codec
zlib
N
Slow
Medium
Y
Y
Default
compression
codec for
Hadoop
org.apache.hadoop.io.com
press.DefaultCodec
6 Depends on the library used
 
Search WWH ::




Custom Search