Database Reference
In-Depth Information
Sometimes we don't need the key to be present in both RDDs to want it in our result.
For example, if we were joining customer information with recommendations we
might not want to drop customers if there were not any recommendations yet. left
OuterJoin(other) and rightOuterJoin(other) both join pair RDDs together by
key, where one of the pair RDDs can be missing the key.
With leftOuterJoin() the resulting pair RDD has entries for each key in the source
RDD. The value associated with each key in the result is a tuple of the value from the
source RDD and an Option (or Optional in Java) for the value from the other pair
RDD. In Python, if a value isn't present None is used; and if the value is present the
regular value, without any wrapper, is used. As with join() , we can have multiple
entries for each key; when this occurs, we get the Cartesian product between the two
lists of values.
Optional is part of Google's Guava library and represents a possiā€
bly missing value. We can check isPresent() to see if it's set, and
get() will return the contained instance provided data is present.
rightOuterJoin() is almost identical to leftOuterJoin() except the key must be
present in the other RDD and the tuple has an option for the source rather than the
other RDD.
We can revisit Example 4-17 and do a leftOuterJoin() and a rightOuterJoin()
between the two pair RDDs we used to illustrate join() in Example 4-18 .
Example 4-18. leftOuterJoin() and rightOuterJoin()
storeAddress . leftOuterJoin ( storeRating ) ==
{( Store ( "Ritual" ),( "1026 Valencia St" , Some ( 4.9 ))),
( Store ( "Starbucks" ),( "Seattle" , None )),
( Store ( "Philz" ),( "748 Van Ness Ave" , Some ( 4.8 ))),
( Store ( "Philz" ),( "3101 24th St" , Some ( 4.8 )))}
storeAddress . rightOuterJoin ( storeRating ) ==
{( Store ( "Ritual" ),( Some ( "1026 Valencia St" ), 4.9 )),
( Store ( "Philz" ),( Some ( "748 Van Ness Ave" ), 4.8 )),
( Store ( "Philz" ), ( Some ( "3101 24th St" ), 4.8 ))}
Sorting Data
Having sorted data is quite useful in many cases, especially when you're producing
downstream output. We can sort an RDD with key/value pairs provided that there is
an ordering defined on the key. Once we have sorted our data, any subsequent call on
the sorted data to collect() or save() will result in ordered data.
Search WWH ::




Custom Search