Database Reference
In-Depth Information
High availability
When running in production settings, you will want your Standalone cluster to be
available to accept applications even if individual nodes in your cluster go down. Out
of the box, the Standalone mode will gracefully support the failure of worker nodes. If
you also want the master of the cluster to be highly available, Spark supports using
Apache ZooKeeper (a distributed coordination system) to keep multiple standby
masters and switch to a new one when any of them fails. Configuring a Standalone
cluster with ZooKeeper is outside the scope of the topic, but is described in the offi‐
cial Spark documentation .
Hadoop YARN
YARN is a cluster manager introduced in Hadoop 2.0 that allows diverse data pro‐
cessing frameworks to run on a shared resource pool, and is typically installed on the
same nodes as the Hadoop filesystem (HDFS). Running Spark on YARN in these
environments is useful because it lets Spark access HDFS data quickly, on the same
nodes where the data is stored.
Using YARN in Spark is straightforward: you set an environment variable that points
to your Hadoop configuration directory, then submit jobs to a special master URL
with spark-submit .
The first step is to figure out your Hadoop configuration directory, and set it as the
environment variable HADOOP_CONF_DIR . This is the directory that contains yarn-
site.xml and other config files; typically, it is HADOOP_HOME/conf if you installed
Hadoop in HADOOP_HOME , or a system path like /etc/hadoop/conf . Then, submit
your application as follows:
export HADOOP_CONF_DIR = "..."
spark-submit --master yarn yourapp
As with the Standalone cluster manager, there are two modes to connect your appli‐
cation to the cluster: client mode, where the driver program for your application runs
on the machine that you submitted the application from (e.g., your laptop), and clus‐
ter mode, where the driver also runs inside a YARN container. You can set the mode
to use via the --deploy-mode argument to spark-submit .
Spark's interactive shell and pyspark both work on YARN as well; simply set
HADOOP_CONF_DIR and pass --master yarn to these applications. Note that these will
run only in client mode since they need to obtain input from the user.
Configuring resource usage
When running on YARN, Spark applications use a fixed number of executors, which
you can set via the --num-executors flag to spark-submit , spark-shell , and so on.
By default, this is only two, so you will likely need to increase it. You can also set the
Search WWH ::




Custom Search