Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

SequenceFiles, and protocol buffers. We will show how to use several common

formats, as well as how to point Spark to different filesystems and configure

compression.

Structured data sources through Spark SQL

The Spark SQL module, covered in Chapter 9 , provides a nicer and often more

efficient API for structured data sources, including JSON and Apache Hive. We

will briefly sketch how to use Spark SQL, but leave the bulk of the details to

Chapter 9 .

Databases and key/value stores

We will sketch built-in and third-party libraries for connecting to Cassandra,

HBase, Elasticsearch, and JDBC databases.

We chose most of the methods here to be available in all of Spark's languages, but

some libraries are still Java and Scala only. We will point out when that is the case.

File Formats

Spark makes it very simple to load and save data in a large number of file formats.

Formats range from unstructured, like text, to semistructured, like JSON, to struc‐

tured, like SequenceFiles (see Table 5-1 ). The input formats that Spark wraps all

transparently handle compressed formats based on the file extension.

Table 5-1. Common supported file formats

Format name

Structured

Comments

Text files

No

Plain old text files. Records are assumed to be one per line.

JSON

Semi

Common text-based format, semistructured; most libraries require one record per line.

CSV

Yes

Very common text-based format, often used with spreadsheet applications.

SequenceFiles

Yes

A common Hadoop file format used for key/value data.

Protocol buffers

Yes

A fast, space-efficient multilanguage format.

Object files

Yes

Useful for saving data from a Spark job to be consumed by shared code. Breaks if you change

your classes, as it relies on Java Serialization.

In addition to the output mechanisms supported directly in Spark, we can use both

Hadoop's new and old file APIs for keyed (or paired) data. We can use these only

with key/value data, because the Hadoop interfaces require key/value data, even

though some formats ignore the key. In cases where the format ignores the key, it is

common to use a dummy key (such as null ).

Learning Spark

Search WWH ::

Custom Search

Home