Database Reference
In-Depth Information
SequenceFiles, and protocol buffers. We will show how to use several common
formats, as well as how to point Spark to different filesystems and configure
compression.
Structured data sources through Spark SQL
The Spark SQL module, covered in Chapter 9 , provides a nicer and often more
efficient API for structured data sources, including JSON and Apache Hive. We
will briefly sketch how to use Spark SQL, but leave the bulk of the details to
Chapter 9 .
Databases and key/value stores
We will sketch built-in and third-party libraries for connecting to Cassandra,
HBase, Elasticsearch, and JDBC databases.
We chose most of the methods here to be available in all of Spark's languages, but
some libraries are still Java and Scala only. We will point out when that is the case.
File Formats
Spark makes it very simple to load and save data in a large number of file formats.
Formats range from unstructured, like text, to semistructured, like JSON, to strucā€
tured, like SequenceFiles (see Table 5-1 ). The input formats that Spark wraps all
transparently handle compressed formats based on the file extension.
Table 5-1. Common supported file formats
Format name
Structured
Comments
Text files
No
Plain old text files. Records are assumed to be one per line.
JSON
Semi
Common text-based format, semistructured; most libraries require one record per line.
CSV
Yes
Very common text-based format, often used with spreadsheet applications.
SequenceFiles
Yes
A common Hadoop file format used for key/value data.
Protocol buffers
Yes
A fast, space-efficient multilanguage format.
Object files
Yes
Useful for saving data from a Spark job to be consumed by shared code. Breaks if you change
your classes, as it relies on Java Serialization.
In addition to the output mechanisms supported directly in Spark, we can use both
Hadoop's new and old file APIs for keyed (or paired) data. We can use these only
with key/value data, because the Hadoop interfaces require key/value data, even
though some formats ignore the key. In cases where the format ignores the key, it is
common to use a dummy key (such as null ).
 
Search WWH ::




Custom Search