Database Reference
In-Depth Information
Now, we can apply a common action operation, count , to return the number of records
in our RDD:
intsFromStringsRDD.count
The result should look something like the following console output:
14/01/29 23:28:28 INFO SparkContext: Starting job: count at
<console>:17
...
14/01/29 23:28:28 INFO SparkContext: Job finished: count at
<console>:17, took 0.019227 s
res4: Long = 398
Perhaps we want to find the average length of each line in this text file. We can first use
the sum function to add up all the lengths of all the records and then divide the sum by the
number of records:
val sumOfRecords = intsFromStringsRDD.sum
val numRecords = intsFromStringsRDD.count
val aveLengthOfRecord = sumOfRecords / numRecords
The result will be as follows:
aveLengthOfRecord: Double = 52.06030150753769
Spark operations, in most cases, return a new RDD, with the exception of most actions,
which return the result of a computation (such as Long for count and Double for sum
in the preceding example). This means that we can naturally chain together operations to
make our program flow more concise and expressive. For example, the same result as the
one in the preceding line of code can be achieved using the following code:
val aveLengthOfRecordChained = rddFromTextFile.map(line =>
line.size).sum / rddFromTextFile.count
An important point to note is that Spark transformations are lazy. That is, invoking a
transformation on an RDD does not immediately trigger a computation. Instead, trans-
formations are chained together and are effectively only computed when an action is
called. This allows Spark to be more efficient by only returning results to the driver when
necessary so that the majority of operations are performed in parallel on the cluster.
Search WWH ::




Custom Search