Database Reference
In-Depth Information
An Example
To introduce Spark, let's run an interactive session using
spark-shell
, which is a Scala
REPL with a few Spark additions. Start up the shell with the following:
%
spark-shell
Spark context available as sc.
scala>
From the console output, we can see that the shell has created a Scala variable,
sc
, to store
the
SparkContext
instance. This is our entry point to Spark, and allows us to load a text
file as follows:
scala>
val lines = sc.textFile("input/ncdc/micro-tab/sample.txt")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12
The
lines
variable is a reference to a
Resilient Distributed Dataset
, abbreviated to
RDD
,
which is the central abstraction in Spark: a read-only collection of objects that is partitioned
across multiple machines in a cluster. In a typical Spark program, one or more RDDs are
loaded as input and through a series of transformations are turned into a set of target RDDs,
which have an action performed on them (such as computing a result or writing them to
persistent storage). The term “resilient” in “Resilient Distributed Dataset” refers to the fact
that Spark can automatically reconstruct a lost partition by recomputing it from the RDDs
that it was computed from.
NOTE
Loading an RDD or performing a transformation on one does not trigger any data processing; it merely
creates a plan for performing a computation. The computation is only triggered when an action (like
foreach()
) is performed on an RDD.
Let's continue with the program. The first transformation we want to perform is to split the
lines into fields:
scala>
val records = lines.map(_.split("\t"))
records: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at
map at
<console>:14
This uses the
map()
method on
RDD
to apply a function to every element in the RDD. In
this case, we split each line (a
String
) into a Scala
Array
of
String
s.