Database Reference
In-Depth Information
An Example
To introduce Spark, let's run an interactive session using spark-shell , which is a Scala
REPL with a few Spark additions. Start up the shell with the following:
% spark-shell
Spark context available as sc.
scala>
From the console output, we can see that the shell has created a Scala variable, sc , to store
the SparkContext instance. This is our entry point to Spark, and allows us to load a text
file as follows:
scala> val lines = sc.textFile("input/ncdc/micro-tab/sample.txt")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12
The lines variable is a reference to a Resilient Distributed Dataset , abbreviated to RDD ,
which is the central abstraction in Spark: a read-only collection of objects that is partitioned
across multiple machines in a cluster. In a typical Spark program, one or more RDDs are
loaded as input and through a series of transformations are turned into a set of target RDDs,
which have an action performed on them (such as computing a result or writing them to
persistent storage). The term “resilient” in “Resilient Distributed Dataset” refers to the fact
that Spark can automatically reconstruct a lost partition by recomputing it from the RDDs
that it was computed from.
NOTE
Loading an RDD or performing a transformation on one does not trigger any data processing; it merely
creates a plan for performing a computation. The computation is only triggered when an action (like
foreach() ) is performed on an RDD.
Let's continue with the program. The first transformation we want to perform is to split the
lines into fields:
scala> val records = lines.map(_.split("\t"))
records: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at
map at
<console>:14
This uses the map() method on RDD to apply a function to every element in the RDD. In
this case, we split each line (a String ) into a Scala Array of String s.
Search WWH ::




Custom Search