Database Reference
In-Depth Information
If your file isn't already on all nodes in the cluster, you can load it locally on the
driver without going through Spark and then call parallelize to distribute the con‐
tents to workers. This approach can be slow, however, so we recommend putting
your files in a shared filesystem like HDFS, NFS, or S3.
Amazon S3
Amazon S3 is an increasingly popular option for storing large amounts of data. S3 is
especially fast when your compute nodes are located inside of Amazon EC2, but can
easily have much worse performance if you have to go over the public Internet.
To access S3 in Spark, you should first set the AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY environment variables to your S3 credentials. You can cre‐
ate these credentials from the Amazon Web Services console. Then pass a path start‐
ing with s3n:// to Spark's file input methods, of the form s3n://bucket/path-
within-bucket . As with all the other filesystems, Spark supports wildcard paths for
S3, such as s3n://bucket/my-files/*.txt .
If you get an S3 access permissions error from Amazon, make sure that the account
for which you specified an access key has both “read” and “list” permissions on the
bucket. Spark needs to be able to list the objects in the bucket to identify the ones you
want to read.
HDFS
The Hadoop Distributed File System (HDFS) is a popular distributed filesystem with
which Spark works well. HDFS is designed to work on commodity hardware and be
resilient to node failure while providing high data throughput. Spark and HDFS can
be collocated on the same machines, and Spark can take advantage of this data local‐
ity to avoid network overhead.
Using Spark with HDFS is as simple as specifying hdfs://master:port/path for
your input and output.
The HDFS protocol changes across Hadoop versions, so if you run
a version of Spark that is compiled for a different version it will fail.
By default Spark is built against Hadoop 1.0.4. If you build from
source, you can specify SPARK_HADOOP_VERSION= as a environment
variable to build against a different version; or you can download a
different precompiled version of Spark. You can determine the
value by running hadoop version .
Search WWH ::




Custom Search