Database Reference
In-Depth Information
Creating Pair RDDs
There are a number of ways to get pair RDDs in Spark. Many formats we explore
loading from in
Chapter 5
will directly return pair RDDs for their key/value data. In
other cases we have a regular RDD that we want to turn into a pair RDD. We can do
this by running a
map()
function that returns key/value pairs. To illustrate, we show
code that starts with an RDD of lines of text and keys the data by the first word in
each line.
The way to build key-value RDDs differs by language. In Python, for the functions on
keyed data to work we need to return an RDD composed of tuples (see
Example 4-1
).
Example 4-1. Creating a pair RDD using the first word as the key in Python
pairs
=
lines
.
map
(
lambda
x
:
(
x
.
split
(
" "
)[
0
],
x
))
In Scala, for the functions on keyed data to be available, we also need to return tuples
(see
Example 4-2
). An implicit conversion on RDDs of tuples exists to provide the
additional key/value functions.
Example 4-2. Creating a pair RDD using the first word as the key in Scala
val
pairs
=
lines
.
map
(
x
=>
(
x
.
split
(
" "
)(
0
),
x
))
Java doesn't have a built-in tuple type, so Spark's Java API has users create tuples
using the
scala.Tuple2
class. This class is very simple: Java users can construct a
new tuple by writing
new Tuple2(elem1, elem2)
and can then access its elements
with the
._1()
and
._2()
methods.
Java users also need to call special versions of Spark's functions when creating pair
RDDs. For instance, the
mapToPair()
function should be used in place of the basic
map()
function. This is discussed in more detail in
“Java” on page 43
, but let's look at
a simple case in
Example 4-3
.
Example 4-3. Creating a pair RDD using the first word as the key in Java
PairFunction
<
String
,
String
,
String
>
keyData
=
new
PairFunction
<
String
,
String
,
String
>()
{
public
Tuple2
<
String
,
String
>
call
(
String
x
)
{
return
new
Tuple2
(
x
.
split
(
" "
)[
0
],
x
);
}
};
JavaPairRDD
<
String
,
String
>
pairs
=
lines
.
mapToPair
(
keyData
);
When creating a pair RDD from an in-memory collection in Scala and Python, we
only need to call
SparkContext.parallelize()
on a collection of pairs. To create a