Database Reference
In-Depth Information
Resilient Distributed Datasets
The core of Spark is a concept called the Resilient Distributed Dataset ( RDD ). An RDD
is a collection of "records" (strictly speaking, objects of some type) that is distributed or
partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the
single multithreaded process can be thought of in the same way). An RDD in Spark is
fault-tolerant; this means that if a given node or task fails (for some reason other than erro-
neous user code, such as hardware failure, loss of communication, and so on), the RDD can
be reconstructed automatically on the remaining nodes and the job will still complete.
Creating RDDs
RDDs can be created from existing collections, for example, in the Scala Spark shell that
you launched earlier:
val collection = List("a", "b", "c", "d", "e")
val rddFromCollection = sc.parallelize(collection)
RDDs can also be created from Hadoop-based input sources, including the local filesystem,
HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input format that imple-
ments the Hadoop InputFormat interface, including text files, other standard Hadoop
formats, HBase, Cassandra, and many more. The following code is an example of creating
an RDD from a text file located on the local filesystem:
val rddFromTextFile = sc.textFile("LICENSE")
The preceding textFile method returns an RDD where each record is a String object
that represents one line of the text file.
Spark operations
Once we have created an RDD, we have a distributed collection of records that we can ma-
nipulate. In Spark's programming model, operations are split into transformations and ac-
tions. Generally speaking, a transformation operation applies some function to all the re-
cords in the dataset, changing the records in some way. An action typically runs some com-
putation or aggregation operation and returns the result to the driver program where
SparkContext is running.
Search WWH ::




Custom Search