Database Reference
In-Depth Information
Chapter 19. Spark
Apache Spark is a cluster computing framework for large-scale data processing. Unlike
most of the other processing frameworks discussed in this topic, Spark does not use
MapReduce as an execution engine; instead, it uses its own distributed runtime for execut-
ing work on a cluster. However, Spark has many parallels with MapReduce, in terms of
both API and runtime, as we will see in this chapter. Spark is closely integrated with Ha-
doop: it can run on YARN and works with Hadoop file formats and storage backends like
HDFS.
Spark is best known for its ability to keep large working datasets in memory between jobs .
This capability allows Spark to outperform the equivalent MapReduce workflow (by an or-
der of magnitude or more in some cases [ 128 ] ), where datasets are always loaded from disk.
Two styles of application that benefit greatly from Spark's processing model are iterative
algorithms (where a function is applied to a dataset repeatedly until an exit condition is
met) and interactive analysis (where a user issues a series of ad hoc exploratory queries on
a dataset).
Even if you don't need in-memory caching, Spark is very attractive for a couple of other
reasons: its DAG engine and its user experience. Unlike MapReduce, Spark's DAG engine
can process arbitrary pipelines of operators and translate them into a single job for the user.
Spark's user experience is also second to none, with a rich set of APIs for performing many
common data processing tasks, such as joins. At the time of writing, Spark provides APIs
in three languages: Scala, Java, and Python. We'll use the Scala API for most of the ex-
amples in this chapter, but they should be easy to translate to the other languages. Spark
also comes with a REPL (read — eval — print loop) for both Scala and Python, which
makes it quick and easy to explore datasets.
Spark is proving to be a good platform on which to build analytics tools, too, and to this
end the Apache Spark project includes modules for machine learning (MLlib), graph pro-
cessing (GraphX), stream processing (Spark Streaming), and SQL (Spark SQL). These
modules are not covered in this chapter; the interested reader should refer to the Apache
Spark website .
Search WWH ::




Custom Search