Database Reference
In-Depth Information
Shared Variables
Spark programs often need to access data that is not part of an RDD. For example, this pro-
gram uses a lookup table in a map() operation:
val lookup = Map ( 1 -> "a" , 2 -> "e" , 3 -> "i" , 4 -> "o" , 5 -> "u" )
val result = sc . parallelize ( Array ( 2 , 1 , 3 )). map ( lookup ( _ ))
assert ( result . collect (). toSet === Set ( "a" , "e" , "i" ))
While it works correctly (the variable lookup is serialized as a part of the closure passed
to map() ), there is a more efficient way to achieve the same thing using broadcast vari-
ables .
Broadcast Variables
A broadcast variable is serialized and sent to each executor, where it is cached so that later
tasks can access it if needed. This is unlike a regular variable that is serialized as part of the
closure, which is transmitted over the network once per task. Broadcast variables play a
similar role to the distributed cache in MapReduce (see Distributed Cache ), although the
implementation in Spark stores the data in memory, only spilling to disk when memory is
exhausted.
A broadcast variable is created by passing the variable to be broadcast to the broad-
cast() method on SparkContext . It returns a Broadcast[T] wrapper around the
variable of type T :
val lookup : Broadcast [ Map [ Int , String ]] =
sc . broadcast ( Map ( 1 -> "a" , 2 -> "e" , 3 -> "i" , 4 -> "o" , 5 ->
"u" ) )
val result = sc . parallelize ( Array ( 2 , 1 , 3 )). map ( lookup . value ( _ ))
assert ( result . collect (). toSet === Set ( "a" , "e" , "i" ))
Notice that the variable is accessed in the RDD map() operation by calling value on the
broadcast variable.
As the name suggests, broadcast variables are sent one way, from driver to task — there is
no way to update a broadcast variable and have the update propagate back to the driver. For
that, we need an accumulator.
Accumulators
An accumulator is a shared variable that tasks can only add to, like counters in MapReduce
(see Counters ). After a job has completed, the accumulator's final value can be retrieved
from the driver program. Here is an example that counts the number of elements in an
Search WWH ::




Custom Search