Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

We will often use collect when we wish to apply further processing to our results loc-

ally within the driver program.

Note

Note that collect should generally only be used in cases where we really want to return

the full result set to the driver and perform further processing. If we try to call collect

on a very large dataset, we might run out of memory on the driver and crash our program.

It is preferable to perform as much heavy-duty processing on our Spark cluster as pos-

sible, preventing the driver from becoming a bottleneck. In many cases, however, collect-

ing results to the driver is necessary, such as during iterations in many machine learning

models.

On inspecting the result, we will see that for each of the three records in our new RDD,

we now have a record that is our original broadcasted List , with the new element appen-

ded to it (that is, there is now either "1" , "2" , or "3" at the end):

...

14/01/31 10:15:39 INFO SparkContext: Job finished: collect

at <console>:15, took 0.025806 s

res6: Array[List[Any]] = Array(List(a, b, c, d, e, 1),

List(a, b, c, d, e, 2), List(a, b, c, d, e, 3))

An accumulator is also a variable that is broadcasted to the worker nodes. The key differ-

ence between a broadcast variable and an accumulator is that while the broadcast variable

is read-only, the accumulator can be added to. There are limitations to this, that is, in par-

ticular, the addition must be an associative operation so that the global accumulated value

can be correctly computed in parallel and returned to the driver program. Each worker

node can only access and add to its own local accumulator value, and only the driver pro-

gram can access the global value. Accumulators are also accessed within the Spark code

using the value method.

Tip

For more details on broadcast variables and accumulators, see the Shared Variables sec-

tion of the Spark Programming Guide : http://spark.apache.org/docs/latest/programming-

guide.html#shared-variables .

Search WWH ::

Custom Search

Home