Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

If your file isn't already on all nodes in the cluster, you can load it locally on the

driver without going through Spark and then call parallelize to distribute the con‐

tents to workers. This approach can be slow, however, so we recommend putting

your files in a shared filesystem like HDFS, NFS, or S3.

Amazon S3

Amazon S3 is an increasingly popular option for storing large amounts of data. S3 is

especially fast when your compute nodes are located inside of Amazon EC2, but can

easily have much worse performance if you have to go over the public Internet.

To access S3 in Spark, you should first set the AWS_ACCESS_KEY_ID and

AWS_SECRET_ACCESS_KEY environment variables to your S3 credentials. You can cre‐

ate these credentials from the Amazon Web Services console. Then pass a path start‐

ing with s3n:// to Spark's file input methods, of the form s3n://bucket/path-

within-bucket . As with all the other filesystems, Spark supports wildcard paths for

S3, such as s3n://bucket/my-files/*.txt .

If you get an S3 access permissions error from Amazon, make sure that the account

for which you specified an access key has both “read” and “list” permissions on the

bucket. Spark needs to be able to list the objects in the bucket to identify the ones you

want to read.

HDFS

The Hadoop Distributed File System (HDFS) is a popular distributed filesystem with

which Spark works well. HDFS is designed to work on commodity hardware and be

resilient to node failure while providing high data throughput. Spark and HDFS can

be collocated on the same machines, and Spark can take advantage of this data local‐

ity to avoid network overhead.

Using Spark with HDFS is as simple as specifying hdfs://master:port/path for

your input and output.

The HDFS protocol changes across Hadoop versions, so if you run

a version of Spark that is compiled for a different version it will fail.

By default Spark is built against Hadoop 1.0.4. If you build from

source, you can specify SPARK_HADOOP_VERSION= as a environment

variable to build against a different version; or you can download a

different precompiled version of Spark. You can determine the

value by running hadoop version .

Search WWH ::

Custom Search

Home