Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Broadcast variables and accumulators

Another core feature of Spark is the ability to create two special types of variables: broad-

cast variables and accumulators.

A broadcast variable is a read-only variable that is made available from the driver pro-

gram that runs the SparkContext object to the nodes that will execute the computation.

This is very useful in applications that need to make the same data available to the worker

nodes in an efficient manner, such as machine learning algorithms. Spark makes creating

broadcast variables as simple as calling a method on SparkContext as follows:

val broadcastAList = sc.broadcast(List("a", "b", "c", "d",

"e"))

The console output shows that the broadcast variable was stored in memory, taking up ap-

proximately 488 bytes, and it also shows that we still have 270 MB available to us:

14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488)

called with curMem=96414, maxMem=311387750

14/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored

as values to memory (estimated size 488.0 B, free 296.9 MB)

broadCastAList:

org.apache.spark.broadcast.Broadcast[List[String]] =

Broadcast(1)

A broadcast variable can be accessed from nodes other than the driver program that created

it (that is, the worker nodes) by calling value on the variable:

sc.parallelize(List("1", "2", "3")).map(x =>

broadcastAList. value ++ x).collect

This code creates a new RDD with three records from a collection (in this case, a Scala

List ) of ("1", "2", "3") . In the map function, it returns a new collection with the

relevant record from our new RDD appended to the broadcastAList that is our broad-

cast variable.

Notice that we used the collect method in the preceding code. This is a Spark action

that returns the entire RDD to the driver as a Scala (or Python or Java) collection.

Search WWH ::

Custom Search

Home