Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

While the Spark EC2 script does not provide commands to resize

clusters, you can resize them by adding or removing machines to

the mycluster-slaves security group. To add machines, first stop

the cluster, then use the AWS management console to right-click

one of the slave nodes and select “Launch more like this.” This will

create more instances in the same security group. Then use spark-

ec2 start to start your cluster. To remove machines, simply ter‐

minate them from the AWS console (though beware that this

destroys data on the cluster's HDFS installations).

Storage on the cluster

Spark EC2 clusters come configured with two installations of the Hadoop filesystem

that you can use for scratch space. This can be handy to save datasets in a medium

that's faster to access than Amazon S3. The two installations are:

• An “ephemeral” HDFS installation using the ephemeral drives on the nodes.

Most Amazon instance types come with a substantial amount of local space

attached on “ephemeral” drives that go away if you stop the instance. This instal‐

lation of HDFS uses this space, giving you a significant amount of scratch space,

but it loses all data when you stop and restart the EC2 cluster. It is installed in

the /root/ephemeral-hdfs directory on the nodes, where you can use the bin/hdfs

command to access and list files. You can also view the web UI and HDFS URL

for it at http://masternode:50070 .

• A “persistent” HDFS installation on the root volumes of the nodes. This instance

persists data even through cluster restarts, but is generally smaller and slower to

access than the ephemeral one. It is good for medium-sized datasets that you do

not wish to download multiple times. It is installed in /root/persistent-hdfs , and

you can view the web UI and HDFS URL for it at http://masternode:60070 .

Apart from these, you will most likely be accessing data from Amazon S3, which you

can do using the s3n:// URI scheme in Spark. Refer to “Amazon S3” on page 90 for

details.

Which Cluster Manager to Use?

The cluster managers supported in Spark offer a variety of options for deploying

applications. If you are starting a new deployment and looking to choose a cluster

manager, we recommend the following guidelines:

• Start with a Standalone cluster if this is a new deployment. Standalone mode is

the easiest to set up and will provide almost all the same features as the other

cluster managers if you are running only Spark.

Search WWH ::

Custom Search

Home