Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Creating a custom Partitioner in Java is very similar to Scala: just extend the

spark.Partitioner class and implement the required methods.

In Python, you do not extend a Partitioner class, but instead pass a hash function as

an additional argument to RDD.partitionBy() . Example 4-27 demonstrates.

Example 4-27. Python custom partitioner

import urlparse

def hash_domain ( url ):

return hash ( urlparse . urlparse ( url ) . netloc )

rdd . partitionBy ( 20 , hash_domain ) # Create 20 partitions

Note that the hash function you pass will be compared by identity to that of other

RDDs. If you want to partition multiple RDDs with the same partitioner, pass the

same function object (e.g., a global function) instead of creating a new lambda for

each one!

Conclusion

In this chapter, we have seen how to work with key/value data using the specialized

functions available in Spark. The techniques from Chapter 3 also still work on our

pair RDDs. In the next chapter, we will look at how to load and save data.

Search WWH ::

Custom Search

Home