Database Reference
In-Depth Information
Broadcast variables and accumulators
Another core feature of Spark is the ability to create two special types of variables: broad-
cast variables and accumulators.
A broadcast variable is a read-only variable that is made available from the driver pro-
gram that runs the SparkContext object to the nodes that will execute the computation.
This is very useful in applications that need to make the same data available to the worker
nodes in an efficient manner, such as machine learning algorithms. Spark makes creating
broadcast variables as simple as calling a method on SparkContext as follows:
val broadcastAList = sc.broadcast(List("a", "b", "c", "d",
"e"))
The console output shows that the broadcast variable was stored in memory, taking up ap-
proximately 488 bytes, and it also shows that we still have 270 MB available to us:
14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488)
called with curMem=96414, maxMem=311387750
14/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored
as values to memory (estimated size 488.0 B, free 296.9 MB)
broadCastAList:
org.apache.spark.broadcast.Broadcast[List[String]] =
Broadcast(1)
A broadcast variable can be accessed from nodes other than the driver program that created
it (that is, the worker nodes) by calling value on the variable:
sc.parallelize(List("1", "2", "3")).map(x =>
broadcastAList. value ++ x).collect
This code creates a new RDD with three records from a collection (in this case, a Scala
List ) of ("1", "2", "3") . In the map function, it returns a new collection with the
relevant record from our new RDD appended to the broadcastAList that is our broad-
cast variable.
Notice that we used the collect method in the preceding code. This is a Spark action
that returns the entire RDD to the driver as a Scala (or Python or Java) collection.
Search WWH ::




Custom Search