Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

Example 8-12. Using the Kryo serializer and registering classes

val conf = new SparkConf ()

conf . set ( "spark.serializer" , "org.apache.spark.serializer.KryoSerializer" )

// Be strict about class registration

conf . set ( "spark.kryo.registrationRequired" , "true" )

conf . registerKryoClasses ( Array ( classOf [ MyClass ], classOf [ MyOtherClass ]))

Whether using Kryo or Java's serializer, you may encounter a NotSerializableExcep

tion if your code refers to a class that does not extend Java's Serializable interface. It

can be difficult to track down which class is causing the problem in this case, since

many different classes can be referenced from user code. Many JVMs support a spe‐

cial option to help debug this situation: "-Dsun.io.serialization.extended Debu

gInfo=true" . You can enable this option this using the --driver-java-options and

--executor-java-options flags to spark-submit . Once you've found the class in

question, the easiest solution is to simply modify it to implement Serializable. If you

cannot modify the class in question you'll need to use more advanced workarounds,

such as creating a subclass of the type in question that implements Java's Externaliza‐

ble interface or customizing the serialization behavior using Kryo.

Memory Management

Spark uses memory in different ways, so understanding and tuning Spark's use of

memory can help optimize your application. Inside of each executor, memory is used

for a few purposes:

RDD storage

When you call persist() or cache() on an RDD, its partitions will be stored in

memory buffers. Spark will limit the amount of memory used when caching to a

certain fraction of the JVM's overall heap, set by spark.storage.memoryFrac

tion . If this limit is exceeded, older partitions will be dropped from memory.

Shuffle and aggregation buffers

When performing shuffle operations, Spark will create intermediate buffers for

storing shuffle output data. These buffers are used to store intermediate results of

aggregations in addition to buffering data that is going to be directly output as

part of the shuffle. Spark will attempt to limit the total amount of memory used

in shuffle-related buffers to spark.shuffle.memoryFraction .

User code

Spark executes arbitrary user code, so user functions can themselves require sub‐

stantial memory. For instance, if a user application allocates large arrays or other

objects, these will contend for overall memory usage. User code has access to

Search WWH ::

Custom Search

Home