Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Sometimes we don't need the key to be present in both RDDs to want it in our result.

For example, if we were joining customer information with recommendations we

might not want to drop customers if there were not any recommendations yet. left

OuterJoin(other) and rightOuterJoin(other) both join pair RDDs together by

key, where one of the pair RDDs can be missing the key.

With leftOuterJoin() the resulting pair RDD has entries for each key in the source

RDD. The value associated with each key in the result is a tuple of the value from the

source RDD and an Option (or Optional in Java) for the value from the other pair

RDD. In Python, if a value isn't present None is used; and if the value is present the

regular value, without any wrapper, is used. As with join() , we can have multiple

entries for each key; when this occurs, we get the Cartesian product between the two

lists of values.

Optional is part of Google's Guava library and represents a possi‐

bly missing value. We can check isPresent() to see if it's set, and

get() will return the contained instance provided data is present.

rightOuterJoin() is almost identical to leftOuterJoin() except the key must be

present in the other RDD and the tuple has an option for the source rather than the

other RDD.

We can revisit Example 4-17 and do a leftOuterJoin() and a rightOuterJoin()

between the two pair RDDs we used to illustrate join() in Example 4-18 .

Example 4-18. leftOuterJoin() and rightOuterJoin()

storeAddress . leftOuterJoin ( storeRating ) ==

{( Store ( "Ritual" ),( "1026 Valencia St" , Some ( 4.9 ))),

( Store ( "Starbucks" ),( "Seattle" , None )),

( Store ( "Philz" ),( "748 Van Ness Ave" , Some ( 4.8 ))),

( Store ( "Philz" ),( "3101 24th St" , Some ( 4.8 )))}

storeAddress . rightOuterJoin ( storeRating ) ==

{( Store ( "Ritual" ),( Some ( "1026 Valencia St" ), 4.9 )),

( Store ( "Philz" ),( Some ( "748 Van Ness Ave" ), 4.8 )),

( Store ( "Philz" ), ( Some ( "3101 24th St" ), 4.8 ))}

Sorting Data

Having sorted data is quite useful in many cases, especially when you're producing

downstream output. We can sort an RDD with key/value pairs provided that there is

an ordering defined on the key. Once we have sorted our data, any subsequent call on

the sorted data to collect() or save() will result in ordered data.

Search WWH ::

Custom Search

Home