Database Reference
In-Depth Information
CHAPTER 7
Running on a Cluster
Introduction
Up to now, we've focused on learning Spark by using the Spark shell and examples
that run in Spark's local mode. One benefit of writing applications on Spark is the
ability to scale computation by adding more machines and running in cluster mode.
The good news is that writing applications for parallel cluster execution uses the same
API you've already learned in this topic. The examples and applications you've writ‐
ten so far will run on a cluster “out of the box.” This is one of the benefits of Spark's
higher level API: users can rapidly prototype applications on smaller datasets locally,
then run unmodified code on even very large clusters.
This chapter first explains the runtime architecture of a distributed Spark application,
then discusses options for running Spark in distributed clusters. Spark can run on a
wide variety of cluster managers (Hadoop YARN, Apache Mesos, and Spark's own
built-in Standalone cluster manager) in both on-premise and cloud deployments.
We'll discuss the trade-offs and configurations required for running in each case.
Along the way we'll also cover the “nuts and bolts” of scheduling, deploying, and
configuring a Spark application. After reading this chapter you'll have everything you
need to run a distributed Spark program. The following chapter will cover tuning and
debugging applications.
Spark Runtime Architecture
Before we dive into the specifics of running Spark on a cluster, it's helpful to under‐
stand the architecture of Spark in distributed mode (illustrated in Figure 7-1 ).
In distributed mode, Spark uses a master/slave architecture with one central coordi‐
nator and many distributed workers. The central coordinator is called the driver . The
 
Search WWH ::




Custom Search