Database Reference
In-Depth Information
We will often use collect when we wish to apply further processing to our results loc-
ally within the driver program.
Note
Note that collect should generally only be used in cases where we really want to return
the full result set to the driver and perform further processing. If we try to call collect
on a very large dataset, we might run out of memory on the driver and crash our program.
It is preferable to perform as much heavy-duty processing on our Spark cluster as pos-
sible, preventing the driver from becoming a bottleneck. In many cases, however, collect-
ing results to the driver is necessary, such as during iterations in many machine learning
models.
On inspecting the result, we will see that for each of the three records in our new RDD,
we now have a record that is our original broadcasted List , with the new element appen-
ded to it (that is, there is now either "1" , "2" , or "3" at the end):
...
14/01/31 10:15:39 INFO SparkContext: Job finished: collect
at <console>:15, took 0.025806 s
res6: Array[List[Any]] = Array(List(a, b, c, d, e, 1),
List(a, b, c, d, e, 2), List(a, b, c, d, e, 3))
An accumulator is also a variable that is broadcasted to the worker nodes. The key differ-
ence between a broadcast variable and an accumulator is that while the broadcast variable
is read-only, the accumulator can be added to. There are limitations to this, that is, in par-
ticular, the addition must be an associative operation so that the global accumulated value
can be correctly computed in parallel and returned to the driver program. Each worker
node can only access and add to its own local accumulator value, and only the driver pro-
gram can access the global value. Accumulators are also accessed within the Spark code
using the value method.
Tip
For more details on broadcast variables and accumulators, see the Shared Variables sec-
tion of the Spark Programming Guide : http://spark.apache.org/docs/latest/programming-
guide.html#shared-variables .
Search WWH ::




Custom Search