Database Reference
In-Depth Information
Creating a custom
Partitioner
in Java is very similar to Scala: just extend the
spark.Partitioner
class and implement the required methods.
In Python, you do not extend a
Partitioner
class, but instead pass a hash function as
an additional argument to
RDD.partitionBy()
.
Example 4-27
demonstrates.
Example 4-27. Python custom partitioner
import
urlparse
def
hash_domain
(
url
):
return
hash
(
urlparse
.
urlparse
(
url
)
.
netloc
)
rdd
.
partitionBy
(
20
,
hash_domain
)
# Create 20 partitions
Note that the hash function you pass will be compared
by identity
to that of other
RDDs. If you want to partition multiple RDDs with the same partitioner, pass the
same function object (e.g., a global function) instead of creating a new
lambda
for
each one!
Conclusion
In this chapter, we have seen how to work with key/value data using the specialized
functions available in Spark. The techniques from
Chapter 3
also still work on our
pair RDDs. In the next chapter, we will look at how to load and save data.