Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Now, we can apply a common action operation, count , to return the number of records

in our RDD:

intsFromStringsRDD.count

The result should look something like the following console output:

14/01/29 23:28:28 INFO SparkContext: Starting job: count at

<console>:17

...

14/01/29 23:28:28 INFO SparkContext: Job finished: count at

<console>:17, took 0.019227 s

res4: Long = 398

Perhaps we want to find the average length of each line in this text file. We can first use

the sum function to add up all the lengths of all the records and then divide the sum by the

number of records:

val sumOfRecords = intsFromStringsRDD.sum

val numRecords = intsFromStringsRDD.count

val aveLengthOfRecord = sumOfRecords / numRecords

The result will be as follows:

aveLengthOfRecord: Double = 52.06030150753769

Spark operations, in most cases, return a new RDD, with the exception of most actions,

which return the result of a computation (such as Long for count and Double for sum

in the preceding example). This means that we can naturally chain together operations to

make our program flow more concise and expressive. For example, the same result as the

one in the preceding line of code can be achieved using the following code:

val aveLengthOfRecordChained = rddFromTextFile.map(line =>

line.size).sum / rddFromTextFile.count

An important point to note is that Spark transformations are lazy. That is, invoking a

transformation on an RDD does not immediately trigger a computation. Instead, trans-

formations are chained together and are effectively only computed when an action is

called. This allows Spark to be more efficient by only returning results to the driver when

necessary so that the majority of operations are performed in parallel on the cluster.

Search WWH ::

Custom Search

Home