Database Reference
In-Depth Information
Sometimes we don't need the key to be present in both RDDs to want it in our result.
For example, if we were joining customer information with recommendations we
might not want to drop customers if there were not any recommendations yet.
left
OuterJoin(other)
and
rightOuterJoin(other)
both join pair RDDs together by
key, where one of the pair RDDs can be missing the key.
With
leftOuterJoin()
the resulting pair RDD has entries for each key in the source
RDD. The value associated with each key in the result is a tuple of the value from the
source RDD and an
Option
(or
Optional
in Java) for the value from the other pair
RDD. In Python, if a value isn't present
None
is used; and if the value is present the
regular value, without any wrapper, is used. As with
join()
, we can have multiple
entries for each key; when this occurs, we get the Cartesian product between the two
lists of values.
Optional
is part of
Google's Guava library
and represents a possiā
bly missing value. We can check
isPresent()
to see if it's set, and
get()
will return the contained instance provided data is present.
rightOuterJoin()
is almost identical to
leftOuterJoin()
except the key must be
present in the other RDD and the tuple has an option for the source rather than the
other RDD.
We can revisit
Example 4-17
and do a
leftOuterJoin()
and a
rightOuterJoin()
between the two pair RDDs we used to illustrate
join()
in
Example 4-18
.
Example 4-18. leftOuterJoin() and rightOuterJoin()
storeAddress
.
leftOuterJoin
(
storeRating
)
==
{(
Store
(
"Ritual"
),(
"1026 Valencia St"
,
Some
(
4.9
))),
(
Store
(
"Starbucks"
),(
"Seattle"
,
None
)),
(
Store
(
"Philz"
),(
"748 Van Ness Ave"
,
Some
(
4.8
))),
(
Store
(
"Philz"
),(
"3101 24th St"
,
Some
(
4.8
)))}
storeAddress
.
rightOuterJoin
(
storeRating
)
==
{(
Store
(
"Ritual"
),(
Some
(
"1026 Valencia St"
),
4.9
)),
(
Store
(
"Philz"
),(
Some
(
"748 Van Ness Ave"
),
4.8
)),
(
Store
(
"Philz"
),
(
Some
(
"3101 24th St"
),
4.8
))}
Sorting Data
Having sorted data is quite useful in many cases, especially when you're producing
downstream output. We can sort an RDD with key/value pairs provided that there is
an ordering defined on the key. Once we have sorted our data, any subsequent call on
the sorted data to
collect()
or
save()
will result in ordered data.