Database Reference
In-Depth Information
Example 8-12. Using the Kryo serializer and registering classes
val conf = new SparkConf ()
conf . set ( "spark.serializer" , "org.apache.spark.serializer.KryoSerializer" )
// Be strict about class registration
conf . set ( "spark.kryo.registrationRequired" , "true" )
conf . registerKryoClasses ( Array ( classOf [ MyClass ], classOf [ MyOtherClass ]))
Whether using Kryo or Java's serializer, you may encounter a NotSerializableExcep
tion if your code refers to a class that does not extend Java's Serializable interface. It
can be difficult to track down which class is causing the problem in this case, since
many different classes can be referenced from user code. Many JVMs support a spe‐
cial option to help debug this situation: "-Dsun.io.serialization.extended Debu
gInfo=true" . You can enable this option this using the --driver-java-options and
--executor-java-options flags to spark-submit . Once you've found the class in
question, the easiest solution is to simply modify it to implement Serializable. If you
cannot modify the class in question you'll need to use more advanced workarounds,
such as creating a subclass of the type in question that implements Java's Externaliza‐
ble interface or customizing the serialization behavior using Kryo.
Memory Management
Spark uses memory in different ways, so understanding and tuning Spark's use of
memory can help optimize your application. Inside of each executor, memory is used
for a few purposes:
RDD storage
When you call persist() or cache() on an RDD, its partitions will be stored in
memory buffers. Spark will limit the amount of memory used when caching to a
certain fraction of the JVM's overall heap, set by spark.storage.memoryFrac
tion . If this limit is exceeded, older partitions will be dropped from memory.
Shuffle and aggregation buffers
When performing shuffle operations, Spark will create intermediate buffers for
storing shuffle output data. These buffers are used to store intermediate results of
aggregations in addition to buffering data that is going to be directly output as
part of the shuffle. Spark will attempt to limit the total amount of memory used
in shuffle-related buffers to spark.shuffle.memoryFraction .
User code
Spark executes arbitrary user code, so user functions can themselves require sub‐
stantial memory. For instance, if a user application allocates large arrays or other
objects, these will contend for overall memory usage. User code has access to
Search WWH ::




Custom Search