Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

CHAPTER 4

Working with Key/Value Pairs

This chapter covers how to work with RDDs of key/value pairs, which are a common

data type required for many operations in Spark. Key/value RDDs are commonly

used to perform aggregations, and often we will do some initial ETL (extract, trans‐

form, and load) to get our data into a key/value format. Key/value RDDs expose new

operations (e.g., counting up reviews for each product, grouping together data with

the same key, and grouping together two different RDDs).

We also discuss an advanced feature that lets users control the layout of pair RDDs

across nodes: partitioning . Using controllable partitioning, applications can some‐

times greatly reduce communication costs by ensuring that data will be accessed

together and will be on the same node. This can provide significant speedups. We

illustrate partitioning using the PageRank algorithm as an example. Choosing the

right partitioning for a distributed dataset is similar to choosing the right data struc‐

ture for a local one—in both cases, data layout can greatly affect performance.

Motivation

Spark provides special operations on RDDs containing key/value pairs. These RDDs

are called pair RDDs. Pair RDDs are a useful building block in many programs, as

they expose operations that allow you to act on each key in parallel or regroup data

across the network. For example, pair RDDs have a reduceByKey() method that can

aggregate data separately for each key, and a join() method that can merge two

RDDs together by grouping elements with the same key. It is common to extract

fields from an RDD (representing, for instance, an event time, customer ID, or other

identifier) and use those fields as keys in pair RDD operations.

Search WWH ::

Custom Search

Home