Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

formats, we can specify a compression codec that will compress the data. As we have

already seen, Spark's native input formats ( textFile and sequenceFile ) can auto‐

matically handle some types of compression for us. When you're reading in com‐

pressed data, there are some compression codecs that can be used to automatically

guess the compression type.

These compression options apply only to the Hadoop formats that support compres‐

sion, namely those that are written out to a filesystem. The database Hadoop formats

generally do not implement support for compression, or if they have compressed

records that is configured in the database itself.

Choosing an output compression codec can have a big impact on future users of the

data. With distributed systems such as Spark, we normally try to read our data in

from multiple different machines. To make this possible, each worker needs to be

able to find the start of a new record. Some compression formats make this impossi‐

ble, which requires a single node to read in all of the data and thus can easily lead to a

bottleneck. Formats that can be easily read from multiple machines are called “split‐

table.” Table 5-3 lists the available compression options.

Table 5-3. Compression options

Format

Splittable

Average

compression

speed

Effectiveness

on text

Hadoop compression codec

Pure

Java

Native

Comments

gzip

N

Fast

High

Y

org.apache.hadoop.io.com

press.GzipCodec

lzo

Y 6

Very fast

Medium

Y

LZO requires

installation

on every

worker node

com.hadoop.compres

sion.lzo.LzoCodec

bzip2

Y

Slow

Very high

Y

Uses pure

Java for

splittable

version

org.apache.hadoop.io.com

press.BZip2Codec

zlib

N

Slow

Medium

Y

Default

compression

codec for

Hadoop

org.apache.hadoop.io.com

press.DefaultCodec

6 Depends on the library used

Learning Spark

Search WWH ::

Custom Search

Home