Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

pair RDD in Java from an in-memory collection, we instead use SparkContext.paral

lelizePairs() .

Transformations on Pair RDDs

Pair RDDs are allowed to use all the transformations available to standard RDDs. The

same rules apply from “Passing Functions to Spark” on page 30 . Since pair RDDs

contain tuples, we need to pass functions that operate on tuples rather than on indi‐

vidual elements. Tables 4-1 and 4-2 summarize transformations on pair RDDs, and

we will dive into the transformations in detail later in the chapter.

Table 4-1. Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})

Function name

Purpose

Example

Result

Combine values with

the same key.

reduceByKey(func)

rdd.reduceByKey(

(x, y) => x + y)

{(1,

2), (3,

10)}

Group values with the

same key.

groupByKey()

rdd.groupByKey()

{(1,

[2]),

(3, [4,

6])}

Combine values with

the same key using a

different result type.

See Examples 4-12 through 4-14 .

combineBy

Key ( createCombiner,

mergeValue,

mergeCombiners,

partitioner )

Apply a function to

each value of a pair

RDD without

changing the key.

mapValues(func)

rdd.mapValues(x => x+1)

{(1,

3), (3,

5), (3,

7)}

Apply a function that

returns an iterator to

each value of a pair

RDD, and for each

element returned,

produce a key/value

entry with the old

key. Often used for

tokenization.

flatMapValues(func)

rdd.flatMapValues(x => (x to 5)

{(1,

2), (1,

3), (1,

4), (1,

5), (3,

4), (3,

5)}

Return an RDD of just

the keys.

keys()

rdd.keys()

{1, 3,

3}

Learning Spark

Search WWH ::

Custom Search

Home