Database Reference
In-Depth Information
executors from the cluster manager, it may receive more or fewer executors depend‐
ing on availability and contention in the cluster. Many cluster managers offer the
ability to define queues with different priorities or capacity limits, and Spark will then
submit jobs to such queues. See the documentation of your specific cluster manager
for more details.
One special case of Spark applications are those that are long lived , meaning that they
are never intended to terminate. An example of a long-lived Spark application is the
JDBC server bundled with Spark SQL. When the JDBC server launches it acquires a
set of executors from the cluster manager, then acts as a permanent gateway for SQL
queries submitted by users. Since this single application is scheduling work for multi‐
ple users, it needs a finer-grained mechanism to enforce sharing policies. Spark pro‐
vides such a mechanism through configurable intra-application scheduling policies.
Spark's internal Fair Scheduler lets long-lived applications define queues for prioritiz‐
ing scheduling of tasks. A detailed review of these is beyond the scope of this topic;
the official documentation on the Fair Scheduler provides a good reference.
Cluster Managers
Spark can run over a variety of cluster managers to access the machines in a cluster. If
you only want to run Spark by itself on a set of machines, the built-in Standalone
mode is the easiest way to deploy it. However, if you have a cluster that you'd like to
share with other distributed applications (e.g., both Spark jobs and Hadoop MapRe‐
duce jobs), Spark can also run over two popular cluster managers: Hadoop YARN
and Apache Mesos. Finally, for deploying on Amazon EC2, Spark comes with built-
in scripts that launch a Standalone cluster and various supporting services. In this
section, we'll cover how to run Spark in each of these environments.
Standalone Cluster Manager
Spark's Standalone manager offers a simple way to run applications on a cluster. It
consists of a master and multiple workers , each with a configured amount of memory
and CPU cores. When you submit an application, you can choose how much mem‐
ory its executors will use, as well as the total number of cores across all executors.
Launching the Standalone cluster manager
You can start the Standalone cluster manager either by starting a master and workers
by hand, or by using launch scripts in Spark's sbin directory. The launch scripts are
the simplest option to use, but require SSH access between your machines and are
currently (as of Spark 1.1) available only on Mac OS X and Linux. We will cover these
first, then show how to launch a cluster manually on other platforms.
To use the cluster launch scripts, follow these steps:
Search WWH ::




Custom Search