Database Reference
In-Depth Information
When our task involves a large setup time, such as creating a database connection or
random-number generator, it is useful to share this setup work across multiple data
items. Using a remote call sign lookup database, we examine how to reuse setup work
by operating on a per-partition basis.
In addition to the languages directly supported by Spark, the system can call into pro‐
grams written in other languages. This chapter introduces how to use Spark's
language-agnostic pipe() method to interact with other programs through standard
input and output. We will use the pipe() method to access an R library for comput‐
ing the distance of a ham radio operator's contacts.
Finally, similar to its tools for working with key/value pairs, Spark has methods for
working with numeric data. We demonstrate these methods by removing outliers
from the distances computed with our ham radio call logs.
Accumulators
When we normally pass functions to Spark, such as a map() function or a condition
for filter() , they can use variables defined outside them in the driver program, but
each task running on the cluster gets a new copy of each variable, and updates from
these copies are not propagated back to the driver. Spark's shared variables, accumu‐
lators and broadcast variables , relax this restriction for two common types of com‐
munication patterns: aggregation of results and broadcasts.
Our first type of shared variable, accumulators, provides a simple syntax for aggregat‐
ing values from worker nodes back to the driver program. One of the most common
uses of accumulators is to count events that occur during job execution for debugging
purposes. For example, say that we are loading a list of all of the call signs for which
we want to retrieve logs from a file, but we are also interested in how many lines of
the input file were blank (perhaps we do not expect to see many such lines in valid
input). Examples 6-2 through 6-4 demonstrate this scenario.
Example 6-2. Accumulator empty line count in Python
file = sc . textFile ( inputFile )
# Create Accumulator[Int] initialized to 0
blankLines = sc . accumulator ( 0 )
def extractCallSigns ( line ):
global blankLines # Make the global variable accessible
if ( line == "" ):
blankLines += 1
return line . split ( " " )
callSigns = file . flatMap ( extractCallSigns )
Search WWH ::




Custom Search