Database Reference
In-Depth Information
This means that if your Spark program never uses an action operation, it will never trigger
an actual computation, and you will not get any results. For example, the following code
will simply return a new RDD that represents the chain of transformations:
val transformedRDD = rddFromTextFile.map(line =>
line.size).filter(size => size > 10).map(size => size * 2)
This returns the following result in the console:
transformedRDD: org.apache.spark.rdd.RDD[Int] =
MappedRDD[8] at map at <console>:14
Notice that no actual computation happens and no result is returned. If we now call an ac-
tion, such as sum , on the resulting RDD, the computation will be triggered:
val computation = transformedRDD.sum
You will now see that a Spark job is run, and it results in the following console output:
...
14/11/27 21:48:21 INFO SparkContext: Job finished: sum at
<console>:16, took 0.193513 s
computation: Double = 60468.0
Tip
The complete list of transformations and actions possible on RDDs as well as a set of
more detailed examples are available in the Spark programming guide (located at ht-
tp://spark.apache.org/docs/latest/programming-guide.html#rdd-operations ) , and the API
documentation (the Scala API documentation) is located at http://spark.apache.org/docs/
latest/api/scala/index.html#org.apache.spark.rdd.RDD ) .
Caching RDDs
One of the most powerful features of Spark is the ability to cache data in memory across a
cluster. This is achieved through use of the cache method on an RDD:
rddFromTextFile.cache
Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first
time an action is called on the RDD that initiates a computation, the data is read from its
Search WWH ::




Custom Search