Database Reference
In-Depth Information
Shared Variables
Spark programs often need to access data that is not part of an RDD. For example, this pro-
gram uses a lookup table in a
map()
operation:
val
lookup
=
Map
(
1
->
"a"
,
2
->
"e"
,
3
->
"i"
,
4
->
"o"
,
5
->
"u"
)
val
result
=
sc
.
parallelize
(
Array
(
2
,
1
,
3
)).
map
(
lookup
(
_
))
assert
(
result
.
collect
().
toSet
===
Set
(
"a"
,
"e"
,
"i"
))
While it works correctly (the variable
lookup
is serialized as a part of the closure passed
to
map()
), there is a more efficient way to achieve the same thing using
broadcast vari-
ables
.
Broadcast Variables
A broadcast variable is serialized and sent to each executor, where it is cached so that later
tasks can access it if needed. This is unlike a regular variable that is serialized as part of the
closure, which is transmitted over the network once per task. Broadcast variables play a
similar role to the distributed cache in MapReduce (see
Distributed Cache
), although the
implementation in Spark stores the data in memory, only spilling to disk when memory is
exhausted.
A broadcast variable is created by passing the variable to be broadcast to the
broad-
cast()
method on
SparkContext
. It returns a
Broadcast[T]
wrapper around the
variable of type
T
:
val
lookup
:
Broadcast
[
Map
[
Int
,
String
]]
=
sc
.
broadcast
(
Map
(
1
->
"a"
,
2
->
"e"
,
3
->
"i"
,
4
->
"o"
,
5
->
"u"
)
)
val
result
=
sc
.
parallelize
(
Array
(
2
,
1
,
3
)).
map
(
lookup
.
value
(
_
))
assert
(
result
.
collect
().
toSet
===
Set
(
"a"
,
"e"
,
"i"
))
Notice that the variable is accessed in the RDD
map()
operation by calling
value
on the
broadcast variable.
As the name suggests, broadcast variables are sent one way, from driver to task — there is
no way to update a broadcast variable and have the update propagate back to the driver. For
that, we need an accumulator.
Accumulators
An accumulator is a shared variable that tasks can only add to, like counters in MapReduce
(see
Counters
). After a job has completed, the accumulator's final value can be retrieved
from the driver program. Here is an example that counts the number of elements in an