Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

executors from the cluster manager, it may receive more or fewer executors depend‐

ing on availability and contention in the cluster. Many cluster managers offer the

ability to define queues with different priorities or capacity limits, and Spark will then

submit jobs to such queues. See the documentation of your specific cluster manager

for more details.

One special case of Spark applications are those that are long lived , meaning that they

are never intended to terminate. An example of a long-lived Spark application is the

JDBC server bundled with Spark SQL. When the JDBC server launches it acquires a

set of executors from the cluster manager, then acts as a permanent gateway for SQL

queries submitted by users. Since this single application is scheduling work for multi‐

ple users, it needs a finer-grained mechanism to enforce sharing policies. Spark pro‐

vides such a mechanism through configurable intra-application scheduling policies.

Spark's internal Fair Scheduler lets long-lived applications define queues for prioritiz‐

ing scheduling of tasks. A detailed review of these is beyond the scope of this topic;

the official documentation on the Fair Scheduler provides a good reference.

Cluster Managers

Spark can run over a variety of cluster managers to access the machines in a cluster. If

you only want to run Spark by itself on a set of machines, the built-in Standalone

mode is the easiest way to deploy it. However, if you have a cluster that you'd like to

share with other distributed applications (e.g., both Spark jobs and Hadoop MapRe‐

duce jobs), Spark can also run over two popular cluster managers: Hadoop YARN

and Apache Mesos. Finally, for deploying on Amazon EC2, Spark comes with built-

in scripts that launch a Standalone cluster and various supporting services. In this

section, we'll cover how to run Spark in each of these environments.

Standalone Cluster Manager

Spark's Standalone manager offers a simple way to run applications on a cluster. It

consists of a master and multiple workers , each with a configured amount of memory

and CPU cores. When you submit an application, you can choose how much mem‐

ory its executors will use, as well as the total number of cores across all executors.

Launching the Standalone cluster manager

You can start the Standalone cluster manager either by starting a master and workers

by hand, or by using launch scripts in Spark's sbin directory. The launch scripts are

the simplest option to use, but require SSH access between your machines and are

currently (as of Spark 1.1) available only on Mac OS X and Linux. We will cover these

first, then show how to launch a cluster manually on other platforms.

To use the cluster launch scripts, follow these steps:

Search WWH ::

Custom Search

Home