Database Reference
In-Depth Information
While the Spark EC2 script does not provide commands to resize
clusters, you can resize them by adding or removing machines to
the mycluster-slaves security group. To add machines, first stop
the cluster, then use the AWS management console to right-click
one of the slave nodes and select “Launch more like this.” This will
create more instances in the same security group. Then use spark-
ec2 start to start your cluster. To remove machines, simply ter‐
minate them from the AWS console (though beware that this
destroys data on the cluster's HDFS installations).
Storage on the cluster
Spark EC2 clusters come configured with two installations of the Hadoop filesystem
that you can use for scratch space. This can be handy to save datasets in a medium
that's faster to access than Amazon S3. The two installations are:
• An “ephemeral” HDFS installation using the ephemeral drives on the nodes.
Most Amazon instance types come with a substantial amount of local space
attached on “ephemeral” drives that go away if you stop the instance. This instal‐
lation of HDFS uses this space, giving you a significant amount of scratch space,
but it loses all data when you stop and restart the EC2 cluster. It is installed in
the /root/ephemeral-hdfs directory on the nodes, where you can use the bin/hdfs
command to access and list files. You can also view the web UI and HDFS URL
for it at http://masternode:50070 .
• A “persistent” HDFS installation on the root volumes of the nodes. This instance
persists data even through cluster restarts, but is generally smaller and slower to
access than the ephemeral one. It is good for medium-sized datasets that you do
not wish to download multiple times. It is installed in /root/persistent-hdfs , and
you can view the web UI and HDFS URL for it at http://masternode:60070 .
Apart from these, you will most likely be accessing data from Amazon S3, which you
can do using the s3n:// URI scheme in Spark. Refer to “Amazon S3” on page 90 for
details.
Which Cluster Manager to Use?
The cluster managers supported in Spark offer a variety of options for deploying
applications. If you are starting a new deployment and looking to choose a cluster
manager, we recommend the following guidelines:
• Start with a Standalone cluster if this is a new deployment. Standalone mode is
the easiest to set up and will provide almost all the same features as the other
cluster managers if you are running only Spark.
Search WWH ::




Custom Search