Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

CHAPTER 7

Running on a Cluster

Introduction

Up to now, we've focused on learning Spark by using the Spark shell and examples

that run in Spark's local mode. One benefit of writing applications on Spark is the

ability to scale computation by adding more machines and running in cluster mode.

The good news is that writing applications for parallel cluster execution uses the same

API you've already learned in this topic. The examples and applications you've writ‐

ten so far will run on a cluster “out of the box.” This is one of the benefits of Spark's

higher level API: users can rapidly prototype applications on smaller datasets locally,

then run unmodified code on even very large clusters.

This chapter first explains the runtime architecture of a distributed Spark application,

then discusses options for running Spark in distributed clusters. Spark can run on a

wide variety of cluster managers (Hadoop YARN, Apache Mesos, and Spark's own

built-in Standalone cluster manager) in both on-premise and cloud deployments.

We'll discuss the trade-offs and configurations required for running in each case.

Along the way we'll also cover the “nuts and bolts” of scheduling, deploying, and

configuring a Spark application. After reading this chapter you'll have everything you

need to run a distributed Spark program. The following chapter will cover tuning and

debugging applications.

Spark Runtime Architecture

Before we dive into the specifics of running Spark on a cluster, it's helpful to under‐

stand the architecture of Spark in distributed mode (illustrated in Figure 7-1 ).

In distributed mode, Spark uses a master/slave architecture with one central coordi‐

nator and many distributed workers. The central coordinator is called the driver . The

Search WWH ::

Custom Search

Home