Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Shared Variables

Spark programs often need to access data that is not part of an RDD. For example, this pro-

gram uses a lookup table in a map() operation:

val lookup = Map ( 1 -> "a" , 2 -> "e" , 3 -> "i" , 4 -> "o" , 5 -> "u" )

val result = sc . parallelize ( Array ( 2 , 1 , 3 )). map ( lookup ( _ ))

assert ( result . collect (). toSet === Set ( "a" , "e" , "i" ))

While it works correctly (the variable lookup is serialized as a part of the closure passed

to map() ), there is a more efficient way to achieve the same thing using broadcast vari-

ables .

Broadcast Variables

A broadcast variable is serialized and sent to each executor, where it is cached so that later

tasks can access it if needed. This is unlike a regular variable that is serialized as part of the

closure, which is transmitted over the network once per task. Broadcast variables play a

similar role to the distributed cache in MapReduce (see Distributed Cache ), although the

implementation in Spark stores the data in memory, only spilling to disk when memory is

exhausted.

A broadcast variable is created by passing the variable to be broadcast to the broad-

cast() method on SparkContext . It returns a Broadcast[T] wrapper around the

variable of type T :

val lookup : Broadcast [ Map [ Int , String ]] =

sc . broadcast ( Map ( 1 -> "a" , 2 -> "e" , 3 -> "i" , 4 -> "o" , 5 ->

"u" ) )

val result = sc . parallelize ( Array ( 2 , 1 , 3 )). map ( lookup . value ( _ ))

assert ( result . collect (). toSet === Set ( "a" , "e" , "i" ))

Notice that the variable is accessed in the RDD map() operation by calling value on the

broadcast variable.

As the name suggests, broadcast variables are sent one way, from driver to task — there is

no way to update a broadcast variable and have the update propagate back to the driver. For

that, we need an accumulator.

Accumulators

An accumulator is a shared variable that tasks can only add to, like counters in MapReduce

(see Counters ). After a job has completed, the accumulator's final value can be retrieved

from the driver program. Here is an example that counts the number of elements in an

Search WWH ::

Custom Search

Home