Database Reference
In-Depth Information
CHAPTER 6
Advanced Spark Programming
Introduction
This chapter introduces a variety of advanced Spark programming features that we
didn't get to cover in the previous chapters. We introduce two types of shared vari‐
ables: accumulators to aggregate information and broadcast variables to efficiently
distribute large values. Building on our existing transformations on RDDs, we intro‐
duce batch operations for tasks with high setup costs, like querying a database. To
expand the range of tools accessible to us, we cover Spark's methods for interacting
with external programs, such as scripts written in R.
Throughout this chapter we build an example using ham radio operators' call logs as
the input. These logs, at the minimum, include the call signs of the stations contacted.
Call signs are assigned by country, and each country has its own range of call signs so
we can look up the countries involved. Some call logs also include the physical loca‐
tion of the operators, which we can use to determine the distance involved. We
include a sample log entry in Example 6-1 . The topic's sample repo includes a list of
call signs to look up the call logs for and process the results.
Example 6-1. Sample call log entry in JSON, with some fields removed
{ "address" : "address here" , "band" : "40m" , "callsign" : "KK6JLK" , "city" : "SUNNYVALE" ,
"contactlat" : "37.384733" , "contactlong" : "-122.032164" ,
"county" : "Santa Clara" , "dxcc" : "291" , "fullname" : "MATTHEW McPherrin" ,
"id" : 57779 , "mode" : "FM" , "mylat" : "37.751952821" , "mylong" : "-122.4208688735" , ... }
The first set of Spark features we'll look at are shared variables, which are a special
type of variable you can use in Spark tasks. In our example we use Spark's shared
variables to count nonfatal error conditions and distribute a large lookup table.
 
Search WWH ::




Custom Search