Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

High availability

When running in production settings, you will want your Standalone cluster to be

available to accept applications even if individual nodes in your cluster go down. Out

of the box, the Standalone mode will gracefully support the failure of worker nodes. If

you also want the master of the cluster to be highly available, Spark supports using

Apache ZooKeeper (a distributed coordination system) to keep multiple standby

masters and switch to a new one when any of them fails. Configuring a Standalone

cluster with ZooKeeper is outside the scope of the topic, but is described in the offi‐

cial Spark documentation .

Hadoop YARN

YARN is a cluster manager introduced in Hadoop 2.0 that allows diverse data pro‐

cessing frameworks to run on a shared resource pool, and is typically installed on the

same nodes as the Hadoop filesystem (HDFS). Running Spark on YARN in these

environments is useful because it lets Spark access HDFS data quickly, on the same

nodes where the data is stored.

Using YARN in Spark is straightforward: you set an environment variable that points

to your Hadoop configuration directory, then submit jobs to a special master URL

with spark-submit .

The first step is to figure out your Hadoop configuration directory, and set it as the

environment variable HADOOP_CONF_DIR . This is the directory that contains yarn-

site.xml and other config files; typically, it is HADOOP_HOME/conf if you installed

Hadoop in HADOOP_HOME , or a system path like /etc/hadoop/conf . Then, submit

your application as follows:

export HADOOP_CONF_DIR = "..."

spark-submit --master yarn yourapp

As with the Standalone cluster manager, there are two modes to connect your appli‐

cation to the cluster: client mode, where the driver program for your application runs

on the machine that you submitted the application from (e.g., your laptop), and clus‐

ter mode, where the driver also runs inside a YARN container. You can set the mode

to use via the --deploy-mode argument to spark-submit .

Spark's interactive shell and pyspark both work on YARN as well; simply set

HADOOP_CONF_DIR and pass --master yarn to these applications. Note that these will

run only in client mode since they need to obtain input from the user.

Configuring resource usage

When running on YARN, Spark applications use a fixed number of executors, which

you can set via the --num-executors flag to spark-submit , spark-shell , and so on.

By default, this is only two, so you will likely need to increase it. You can also set the

Search WWH ::

Custom Search

Home