Database Reference
In-Depth Information
One issue to watch out for when passing functions is inadvertently serializing the
object containing the function. When you pass a function that is the member of an
object, or contains references to fields in an object (e.g.,
self.field
), Spark sends the
entire
object to worker nodes, which can be much larger than the bit of information
you need (see
Example 3-19
). Sometimes this can also cause your program to fail, if
your class contains objects that Python can't figure out how to pickle.
Example 3-19. Passing a function with field references (don't do this!)
class
SearchFunctions
(
object
):
def
__init__
(
self
,
query
):
self
.
query
=
query
def
isMatch
(
self
,
s
):
return
self
.
query
in
s
def
getMatchesFunctionReference
(
self
,
rdd
):
# Problem: references all of "self" in "self.isMatch"
return
rdd
.
filter
(
self
.
isMatch
)
def
getMatchesMemberReference
(
self
,
rdd
):
# Problem: references all of "self" in "self.query"
return
rdd
.
filter
(
lambda
x
:
self
.
query
in
x
)
Instead, just extract the fields you need from your object into a local variable and pass
that in, like we do in
Example 3-20
.
Example 3-20. Python function passing without field references
class
WordFunctions
(
object
):
...
def
getMatchesNoReference
(
self
,
rdd
):
# Safe: extract only the field we need into a local variable
query
=
self
.
query
return
rdd
.
filter
(
lambda
x
:
query
in
x
)
Scala
In Scala, we can pass in functions defined inline, references to methods, or static
functions as we do for Scala's other functional APIs. Some other considerations come
into play, though—namely that the function we pass and the data referenced in it
needs to be serializable (implementing Java's Serializable interface). Furthermore, as
in Python, passing a method or field of an object includes a reference to that whole
object, though this is less obvious because we are not forced to write these references
with
self
. As we did with Python in
Example 3-20
, we can instead extract the fields
we need as local variables and avoid needing to pass the whole object containing
them, as shown in
Example 3-21
.