Database Reference
In-Depth Information
Broadcast variables and accumulators
Another core feature of Spark is the ability to create two special types of variables: broad-
cast variables and accumulators.
A
broadcast variable
is a
read-only
variable that is made available from the driver pro-
gram that runs the
SparkContext
object to the nodes that will execute the computation.
This is very useful in applications that need to make the same data available to the worker
nodes in an efficient manner, such as machine learning algorithms. Spark makes creating
broadcast variables as simple as calling a method on
SparkContext
as follows:
val broadcastAList = sc.broadcast(List("a", "b", "c", "d",
"e"))
The console output shows that the broadcast variable was stored in memory, taking up ap-
proximately 488 bytes, and it also shows that we still have 270 MB available to us:
14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488)
called with curMem=96414, maxMem=311387750
14/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored
as values to memory (estimated size 488.0 B, free 296.9 MB)
broadCastAList:
org.apache.spark.broadcast.Broadcast[List[String]] =
Broadcast(1)
A broadcast variable can be accessed from nodes other than the driver program that created
it (that is, the worker nodes) by calling
value
on the variable:
sc.parallelize(List("1", "2", "3")).map(x =>
broadcastAList.
value
++ x).collect
This code creates a new RDD with three records from a collection (in this case, a Scala
List
) of
("1", "2", "3")
. In the
map
function, it returns a new collection with the
relevant record from our new RDD appended to the
broadcastAList
that is our broad-
cast variable.
Notice that we used the
collect
method in the preceding code. This is a Spark
action
that returns the entire RDD to the driver as a Scala (or Python or Java) collection.