Database Reference
In-Depth Information
CHAPTER 4
Working with Key/Value Pairs
This chapter covers how to work with RDDs of key/value pairs, which are a common
data type required for many operations in Spark. Key/value RDDs are commonly
used to perform aggregations, and often we will do some initial ETL (extract, trans‐
form, and load) to get our data into a key/value format. Key/value RDDs expose new
operations (e.g., counting up reviews for each product, grouping together data with
the same key, and grouping together two different RDDs).
We also discuss an advanced feature that lets users control the layout of pair RDDs
across nodes: partitioning . Using controllable partitioning, applications can some‐
times greatly reduce communication costs by ensuring that data will be accessed
together and will be on the same node. This can provide significant speedups. We
illustrate partitioning using the PageRank algorithm as an example. Choosing the
right partitioning for a distributed dataset is similar to choosing the right data struc‐
ture for a local one—in both cases, data layout can greatly affect performance.
Motivation
Spark provides special operations on RDDs containing key/value pairs. These RDDs
are called pair RDDs. Pair RDDs are a useful building block in many programs, as
they expose operations that allow you to act on each key in parallel or regroup data
across the network. For example, pair RDDs have a reduceByKey() method that can
aggregate data separately for each key, and a join() method that can merge two
RDDs together by grouping elements with the same key. It is common to extract
fields from an RDD (representing, for instance, an event time, customer ID, or other
identifier) and use those fields as keys in pair RDD operations.
 
Search WWH ::




Custom Search