Database Reference
In-Depth Information
Chapter 19. Spark
Apache Spark is a cluster computing framework for large-scale data processing. Unlike
most of the other processing frameworks discussed in this topic, Spark does not use
MapReduce as an execution engine; instead, it uses its own distributed runtime for execut-
ing work on a cluster. However, Spark has many parallels with MapReduce, in terms of
both API and runtime, as we will see in this chapter. Spark is closely integrated with Ha-
doop: it can run on YARN and works with Hadoop file formats and storage backends like
Spark is best known for its ability to keep large working datasets in memory between jobs .
This capability allows Spark to outperform the equivalent MapReduce workflow (by an or-
der of magnitude or more in some cases [ 128 ] ), where datasets are always loaded from disk.
Two styles of application that benefit greatly from Spark's processing model are iterative
algorithms (where a function is applied to a dataset repeatedly until an exit condition is
met) and interactive analysis (where a user issues a series of ad hoc exploratory queries on
a dataset).
Even if you don't need in-memory caching, Spark is very attractive for a couple of other
reasons: its DAG engine and its user experience. Unlike MapReduce, Spark's DAG engine
can process arbitrary pipelines of operators and translate them into a single job for the user.
Spark's user experience is also second to none, with a rich set of APIs for performing many
common data processing tasks, such as joins. At the time of writing, Spark provides APIs
in three languages: Scala, Java, and Python. We'll use the Scala API for most of the ex-
amples in this chapter, but they should be easy to translate to the other languages. Spark
also comes with a REPL (read — eval — print loop) for both Scala and Python, which
makes it quick and easy to explore datasets.
Spark is proving to be a good platform on which to build analytics tools, too, and to this
end the Apache Spark project includes modules for machine learning (MLlib), graph pro-
cessing (GraphX), stream processing (Spark Streaming), and SQL (Spark SQL). These
modules are not covered in this chapter; the interested reader should refer to the Apache
Spark website .
Search WWH ::

Custom Search