Database Reference
In-Depth Information
Format
Splittable
Average
compression
speed
Effectiveness
on text
Hadoop compression codec
Pure
Java
Native
Comments
Snappy
N
Very Fast
Low
N
Y
There is a
pure Java
port of
Snappy but it
is not yet
available in
Spark/
Hadoop
org.apache.hadoop.io.com
press.SnappyCodec
While Spark's textFile() method can handle compressed input, it
automatically disables splittable even if the input is compressed
such that it could be read in a splittable way. If you find yourself
needing to read in a large single-file compressed input, consider
skipping Spark's wrapper and instead use either newAPIHadoopFile
or hadoopFile and specify the correct compression codec.
Some input formats (like SequenceFiles) allow us to compress only the values in key/
value data, which can be useful for doing lookups. Other input formats have their
own compression control: for example, many of the formats in Twitter's Elephant
Bird package work with LZO compressed data.
Filesystems
Spark supports a large number of filesystems for reading and writing to, which we
can use with any of the file formats we want.
Local/“Regular” FS
While Spark supports loading files from the local filesystem, it requires that the files
are available at the same path on all nodes in your cluster .
Some network filesystems, like NFS, AFS, and MapR's NFS layer, are exposed to the
user as a regular filesystem. If your data is already in one of these systems, then you
can use it as an input by just specifying a file:// path; Spark will handle it as long as
the filesystem is mounted at the same path on each node (see Example 5-29 ).
Example 5-29. Loading a compressed text file from the local filesystem in Scala
val rdd = sc . textFile ( "file:///home/holden/happypandas.gz" )
 
Search WWH ::




Custom Search