Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 19. Spark

Apache Spark is a cluster computing framework for large-scale data processing. Unlike

most of the other processing frameworks discussed in this topic, Spark does not use

MapReduce as an execution engine; instead, it uses its own distributed runtime for execut-

ing work on a cluster. However, Spark has many parallels with MapReduce, in terms of

both API and runtime, as we will see in this chapter. Spark is closely integrated with Ha-

doop: it can run on YARN and works with Hadoop file formats and storage backends like

HDFS.

Spark is best known for its ability to keep large working datasets in memory between jobs .

This capability allows Spark to outperform the equivalent MapReduce workflow (by an or-

der of magnitude or more in some cases [ 128 ] ), where datasets are always loaded from disk.

Two styles of application that benefit greatly from Spark's processing model are iterative

algorithms (where a function is applied to a dataset repeatedly until an exit condition is

met) and interactive analysis (where a user issues a series of ad hoc exploratory queries on

a dataset).

Even if you don't need in-memory caching, Spark is very attractive for a couple of other

reasons: its DAG engine and its user experience. Unlike MapReduce, Spark's DAG engine

can process arbitrary pipelines of operators and translate them into a single job for the user.

Spark's user experience is also second to none, with a rich set of APIs for performing many

common data processing tasks, such as joins. At the time of writing, Spark provides APIs

in three languages: Scala, Java, and Python. We'll use the Scala API for most of the ex-

amples in this chapter, but they should be easy to translate to the other languages. Spark

also comes with a REPL (read — eval — print loop) for both Scala and Python, which

makes it quick and easy to explore datasets.

Spark is proving to be a good platform on which to build analytics tools, too, and to this

end the Apache Spark project includes modules for machine learning (MLlib), graph pro-

cessing (GraphX), stream processing (Spark Streaming), and SQL (Spark SQL). These

modules are not covered in this chapter; the interested reader should refer to the Apache

Spark website .

Search WWH ::

Custom Search

Home