Advanced Spark Programming - Learning Spark

Database Reference

In-Depth Information

CHAPTER 6

Advanced Spark Programming

Introduction

This chapter introduces a variety of advanced Spark programming features that we

didn't get to cover in the previous chapters. We introduce two types of shared vari‐

ables: accumulators to aggregate information and broadcast variables to efficiently

distribute large values. Building on our existing transformations on RDDs, we intro‐

duce batch operations for tasks with high setup costs, like querying a database. To

expand the range of tools accessible to us, we cover Spark's methods for interacting

with external programs, such as scripts written in R.

Throughout this chapter we build an example using ham radio operators' call logs as

the input. These logs, at the minimum, include the call signs of the stations contacted.

Call signs are assigned by country, and each country has its own range of call signs so

we can look up the countries involved. Some call logs also include the physical loca‐

tion of the operators, which we can use to determine the distance involved. We

include a sample log entry in Example 6-1 . The topic's sample repo includes a list of

call signs to look up the call logs for and process the results.

Example 6-1. Sample call log entry in JSON, with some fields removed

{ "address" : "address here" , "band" : "40m" , "callsign" : "KK6JLK" , "city" : "SUNNYVALE" ,

"contactlat" : "37.384733" , "contactlong" : "-122.032164" ,

"county" : "Santa Clara" , "dxcc" : "291" , "fullname" : "MATTHEW McPherrin" ,

"id" : 57779 , "mode" : "FM" , "mylat" : "37.751952821" , "mylong" : "-122.4208688735" , ... }

The first set of Spark features we'll look at are shared variables, which are a special

type of variable you can use in Spark tasks. In our example we use Spark's shared

variables to count nonfatal error conditions and distribute a large lookup table.

Search WWH ::

Custom Search

Home