Database Reference
In-Depth Information
pair RDD in Java from an in-memory collection, we instead use SparkContext.paral
lelizePairs() .
Transformations on Pair RDDs
Pair RDDs are allowed to use all the transformations available to standard RDDs. The
same rules apply from “Passing Functions to Spark” on page 30 . Since pair RDDs
contain tuples, we need to pass functions that operate on tuples rather than on indi‐
vidual elements. Tables 4-1 and 4-2 summarize transformations on pair RDDs, and
we will dive into the transformations in detail later in the chapter.
Table 4-1. Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
Function name
Purpose
Example
Result
Combine values with
the same key.
reduceByKey(func)
rdd.reduceByKey(
(x, y) => x + y)
{(1,
2), (3,
10)}
Group values with the
same key.
groupByKey()
rdd.groupByKey()
{(1,
[2]),
(3, [4,
6])}
Combine values with
the same key using a
different result type.
See Examples 4-12 through 4-14 .
combineBy
Key ( createCombiner,
mergeValue,
mergeCombiners,
partitioner )
Apply a function to
each value of a pair
RDD without
changing the key.
mapValues(func)
rdd.mapValues(x => x+1)
{(1,
3), (3,
5), (3,
7)}
Apply a function that
returns an iterator to
each value of a pair
RDD, and for each
element returned,
produce a key/value
entry with the old
key. Often used for
tokenization.
flatMapValues(func)
rdd.flatMapValues(x => (x to 5)
{(1,
2), (1,
3), (1,
4), (1,
5), (3,
4), (3,
5)}
Return an RDD of just
the keys.
keys()
rdd.keys()
{1, 3,
3}
Search WWH ::




Custom Search