Database Reference
In-Depth Information
Creating a custom Partitioner in Java is very similar to Scala: just extend the
spark.Partitioner class and implement the required methods.
In Python, you do not extend a Partitioner class, but instead pass a hash function as
an additional argument to RDD.partitionBy() . Example 4-27 demonstrates.
Example 4-27. Python custom partitioner
import urlparse
def hash_domain ( url ):
return hash ( urlparse . urlparse ( url ) . netloc )
rdd . partitionBy ( 20 , hash_domain ) # Create 20 partitions
Note that the hash function you pass will be compared by identity to that of other
RDDs. If you want to partition multiple RDDs with the same partitioner, pass the
same function object (e.g., a global function) instead of creating a new lambda for
each one!
Conclusion
In this chapter, we have seen how to work with key/value data using the specialized
functions available in Spark. The techniques from Chapter 3 also still work on our
pair RDDs. In the next chapter, we will look at how to load and save data.
Search WWH ::




Custom Search