Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

This means that if your Spark program never uses an action operation, it will never trigger

an actual computation, and you will not get any results. For example, the following code

will simply return a new RDD that represents the chain of transformations:

val transformedRDD = rddFromTextFile.map(line =>

line.size).filter(size => size > 10).map(size => size * 2)

This returns the following result in the console:

transformedRDD: org.apache.spark.rdd.RDD[Int] =

MappedRDD[8] at map at <console>:14

Notice that no actual computation happens and no result is returned. If we now call an ac-

tion, such as sum , on the resulting RDD, the computation will be triggered:

val computation = transformedRDD.sum

You will now see that a Spark job is run, and it results in the following console output:

...

14/11/27 21:48:21 INFO SparkContext: Job finished: sum at

<console>:16, took 0.193513 s

computation: Double = 60468.0

Tip

The complete list of transformations and actions possible on RDDs as well as a set of

more detailed examples are available in the Spark programming guide (located at ht-

documentation (the Scala API documentation) is located at http://spark.apache.org/docs/

Caching RDDs

One of the most powerful features of Spark is the ability to cache data in memory across a

cluster. This is achieved through use of the cache method on an RDD:

rddFromTextFile.cache

Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first

time an action is called on the RDD that initiates a computation, the data is read from its

Search WWH ::

Custom Search

Home