Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Resilient Distributed Datasets

The core of Spark is a concept called the Resilient Distributed Dataset ( RDD ). An RDD

is a collection of "records" (strictly speaking, objects of some type) that is distributed or

partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the

single multithreaded process can be thought of in the same way). An RDD in Spark is

fault-tolerant; this means that if a given node or task fails (for some reason other than erro-

neous user code, such as hardware failure, loss of communication, and so on), the RDD can

be reconstructed automatically on the remaining nodes and the job will still complete.

Creating RDDs

RDDs can be created from existing collections, for example, in the Scala Spark shell that

you launched earlier:

val collection = List("a", "b", "c", "d", "e")

val rddFromCollection = sc.parallelize(collection)

RDDs can also be created from Hadoop-based input sources, including the local filesystem,

HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input format that imple-

ments the Hadoop InputFormat interface, including text files, other standard Hadoop

formats, HBase, Cassandra, and many more. The following code is an example of creating

an RDD from a text file located on the local filesystem:

val rddFromTextFile = sc.textFile("LICENSE")

The preceding textFile method returns an RDD where each record is a String object

that represents one line of the text file.

Spark operations

Once we have created an RDD, we have a distributed collection of records that we can ma-

nipulate. In Spark's programming model, operations are split into transformations and ac-

tions. Generally speaking, a transformation operation applies some function to all the re-

cords in the dataset, changing the records in some way. An action typically runs some com-

putation or aggregation operation and returns the result to the driver program where

SparkContext is running.

Search WWH ::

Custom Search

Home