Database Reference
In-Depth Information
CHAPTER 5
Loading and Saving Your Data
Both engineers and data scientists will find parts of this chapter useful. Engineers
may wish to explore more output formats to see if there is something well suited to
their intended downstream consumer. Data scientists can likely focus on the format
that their data is already in.
Motivation
We've looked at a number of operations we can perform on our data once we have it
distributed in Spark. So far our examples have loaded and saved all of their data from
a native collection and regular files, but odds are that your data doesn't fit on a single
machine, so it's time to explore our options for loading and saving.
Spark supports a wide range of input and output sources, partly because it builds on
the ecosystem available for Hadoop. In particular, Spark can access data through the
InputFormat and OutputFormat interfaces used by Hadoop MapReduce, which are
available for many common file formats and storage systems (e.g., S3, HDFS, Cassan‐
dra, HBase, etc.). 1 The section “Hadoop Input and Output Formats” on page 84 shows
how to use these formats directly.
More commonly, though, you will want to use higher-level APIs built on top of these
raw interfaces. Luckily, Spark and its ecosystem provide many options here. In this
chapter, we will cover three common sets of data sources:
File formats and filesystems
For data stored in a local or distributed filesystem, such as NFS, HDFS, or Ama‐
zon S3, Spark can access a variety of file formats including text, JSON,
1 InputFormat and OutputFormat are Java APIs used to connect a data source with MapReduce.
 
Search WWH ::




Custom Search