Database Reference
In-Depth Information
Function name
Purpose
Example
Result
Return an RDD of just
the values.
values()
rdd.values()
{2, 4,
6}
Return an RDD sorted
by the key.
sortByKey()
rdd.sortByKey()
{(1,
2), (3,
4), (3,
6)}
Table 4-2. Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)})
Function name
Purpose
Example
Result
Remove elements with a
key present in the other
RDD.
subtractByKey
rdd.subtractByKey(other)
{(1, 2)}
Perform an inner join
between two RDDs.
join
rdd.join(other)
{(3, (4, 9)), (3,
(6, 9))}
rightOuterJoin Perform a join between two
RDDs where the key must
be present in the first RDD.
rdd.rightOuterJoin(other) {(3,(Some(4),9)),
(3,(Some(6),9))}
Perform a join between two
RDDs where the key must
be present in the other RDD.
leftOuterJoin
rdd.leftOuterJoin(other)
{(1,(2,None)), (3,
(4,Some(9))), (3,
(6,Some(9)))}
Group data from both RDDs
sharing the same key.
cogroup
rdd.cogroup(other)
{(1,([2],[])), (3,
([4, 6],[9]))}
We discuss each of these families of pair RDD functions in more detail in the upcom‐
ing sections.
Pair RDDs are also still RDDs (of Tuple2 objects in Java/Scala or of Python tuples),
and thus support the same functions as RDDs. For instance, we can take our pair
RDD from the previous section and filter out lines longer than 20 characters, as
shown in Examples 4-4 through 4-6 and Figure 4-1 .
Example 4-4. Simple filter on second element in Python
result = pairs . filter ( lambda keyValue : len ( keyValue [ 1 ]) < 20 )
Example 4-5. Simple filter on second element in Scala
pairs . filter { case ( key , value ) => value . length < 20 }
 
Search WWH ::




Custom Search