Database Reference
In-Depth Information
We will often use
collect
when we wish to apply further processing to our results loc-
ally within the driver program.
Note
Note that
collect
should generally only be used in cases where we really want to return
the full result set to the driver and perform further processing. If we try to call
collect
on a very large dataset, we might run out of memory on the driver and crash our program.
It is preferable to perform as much heavy-duty processing on our Spark cluster as pos-
sible, preventing the driver from becoming a bottleneck. In many cases, however, collect-
ing results to the driver is necessary, such as during iterations in many machine learning
models.
On inspecting the result, we will see that for each of the three records in our new RDD,
we now have a record that is our original broadcasted
List
, with the new element appen-
ded to it (that is, there is now either
"1"
,
"2"
, or
"3"
at the end):
...
14/01/31 10:15:39 INFO SparkContext: Job finished: collect
at <console>:15, took 0.025806 s
res6: Array[List[Any]] = Array(List(a, b, c, d, e, 1),
List(a, b, c, d, e, 2), List(a, b, c, d, e, 3))
An
accumulator
is also a variable that is broadcasted to the worker nodes. The key differ-
ence between a broadcast variable and an accumulator is that while the broadcast variable
is read-only, the accumulator can be added to. There are limitations to this, that is, in par-
ticular, the addition must be an associative operation so that the global accumulated value
can be correctly computed in parallel and returned to the driver program. Each worker
node can only access and add to its own local accumulator value, and only the driver pro-
gram can access the global value. Accumulators are also accessed within the Spark code
using the
value
method.
Tip
For more details on broadcast variables and accumulators, see the
Shared Variables
sec-
tion of the
Spark Programming Guide
:
http://spark.apache.org/docs/latest/programming-