Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

Alternatively, you can find the master's hostname by running:

./spark-ec2 get-master mycluster

Then SSH into it yourself using ssh -i keypair.pem root@masternode .

Once you are in the cluster, you can use the Spark installation in /root/spark to run

programs. This is a Standalone cluster installation, with the master URL spark://

masternode:7077 . If you launch an application with spark-submit , it will come cor‐

rectly configured to submit your application to this cluster automatically. You can

view the cluster's web UI at http://masternode:8080 .

Note that only programs launched from the cluster will be able to submit jobs to it

with spark-submit ; the firewall rules will prevent external hosts from submitting

them for security reasons. To run a prepackaged application on the cluster, first copy

it over using SCP:

scp -i mykeypair.pem app.jar root@masternode:~

Destroying a cluster

To destroy a cluster launched by spark-ec2 , run:

./spark-ec2 destroy mycluster

This will terminate all the instances associated with the cluster (i.e., all instances in its

two security groups, mycluster-master and mycluster-slaves ).

Pausing and restarting clusters

In addition to outright terminating clusters, spark-ec2 lets you stop the Amazon

instances running your cluster and then start them again later. Stopping instances

shuts them down and makes them lose all data on the “ephemeral” disks, which are

configured with an installation of HDFS for spark-ec2 (see “Storage on the cluster”

on page 138 ). However, the stopped instances retain all data in their root directory (e.g.,

any files you uploaded there), so you'll be able to quickly return to work.

To stop a cluster, use:

./spark-ec2 stop mycluster

Then, later, to start it up again:

./spark-ec2 -k mykeypair -i mykeypair.pem start mycluster

Search WWH ::

Custom Search

Home