Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Format

Splittable

Average

compression

speed

Effectiveness

on text

Hadoop compression codec

Pure

Java

Native

Comments

Snappy

N

Very Fast

Low

N

Y

There is a

pure Java

port of

Snappy but it

is not yet

available in

Spark/

Hadoop

org.apache.hadoop.io.com

press.SnappyCodec

While Spark's textFile() method can handle compressed input, it

automatically disables splittable even if the input is compressed

such that it could be read in a splittable way. If you find yourself

needing to read in a large single-file compressed input, consider

skipping Spark's wrapper and instead use either newAPIHadoopFile

or hadoopFile and specify the correct compression codec.

Some input formats (like SequenceFiles) allow us to compress only the values in key/

value data, which can be useful for doing lookups. Other input formats have their

own compression control: for example, many of the formats in Twitter's Elephant

Bird package work with LZO compressed data.

Filesystems

Spark supports a large number of filesystems for reading and writing to, which we

can use with any of the file formats we want.

Local/“Regular” FS

While Spark supports loading files from the local filesystem, it requires that the files

are available at the same path on all nodes in your cluster .

Some network filesystems, like NFS, AFS, and MapR's NFS layer, are exposed to the

user as a regular filesystem. If your data is already in one of these systems, then you

can use it as an input by just specifying a file:// path; Spark will handle it as long as

the filesystem is mounted at the same path on each node (see Example 5-29 ).

Example 5-29. Loading a compressed text file from the local filesystem in Scala

val rdd = sc . textFile ( "file:///home/holden/happypandas.gz" )

Learning Spark

Search WWH ::

Custom Search

Home