Database Reference
In-Depth Information
Serialization
There are two aspects of serialization to consider in Spark: serialization of data and serial-
ization of functions (or closures).
Data
Let's look at data serialization first. By default, Spark will use Java serialization to send
data over the network from one executor to another, or when caching (persisting) data in
serialized form as described in
Persistence levels
. Java serialization is well understood by
programmers (you make sure the class you are using implements
java.io.Serializable
or
java.io.Externalizable
), but it is not particu-
larly efficient from a performance or size perspective.
A better choice for most Spark programs is
Kryo serialization
.
Kryo is a more efficient
general-purpose serialization library for Java. In order to use Kryo serialization, set the
spark.serializer
as follows on the
SparkConf
in your driver program:
conf
.
set
(
"spark.serializer"
,
"org.apache.spark.serializer.KryoSerializer"
)
Kryo does not require that a class implement a particular interface (like
java.io.Serializable
) to be serialized, so plain old Java objects can be used in
RDDs without any further work beyond enabling Kryo serialization. Having said that, it is
much more efficient to register classes with Kryo before using them. This is because Kryo
writes a reference to the class of the object being serialized (one reference is written for
every object written), which is just an integer identifier if the class has been registered but
is the full classname otherwise. This guidance only applies to your own classes; Spark re-
gisters Scala classes and many other framework classes (like Avro Generic or Thrift
classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of
KryoRegis-
trator
, and override the
registerClasses()
method:
class
CustomKryoRegistrator
extends
KryoRegistrator
{
override def
registerClasses
(
kryo
:
Kryo
) {
kryo
.
register
(
classOf
[
WeatherRecord
])
}
}
Finally, in your driver program, set the
spark.kryo.registrator
property to the
fully qualified classname of your
KryoRegistrator
implementation:
conf
.
set
(
"spark.kryo.registrator"
,
"CustomKryoRegistrator"
)