Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

everything left in the JVM heap after the space for RDD storage and shuffle stor‐

age are allocated.

By default Spark will leave 60% of space for RDD storage, 20% for shuffle memory,

and the remaining 20% for user programs. In some cases users can tune these options

for better performance. If your user code is allocating very large objects, it might

make sense to decrease the storage and shuffle regions to avoid running out of

memory.

In addition to tweaking memory regions, you can improve certain elements of

Spark's default caching behavior for some workloads. Spark's default cache() opera‐

tion persists memory using the MEMORY_ONLY storage level. This means that if there is

not enough space to cache new RDD partitions, old ones will simply be deleted and,

if they are needed again, they will be recomputed. It is sometimes better to call per

sist() with the MEMORY_AND_DISK storage level, which instead drops RDD partitions

to disk and simply reads them back to memory from a local store if they are needed

again. This can be much cheaper than recomputing blocks and can lead to more pre‐

dictable performance. This is particularly useful if your RDD partitions are very

expensive to recompute (for instance, if you are reading data from a database). The

full list of possible storage levels is given in Table 3-6 .

A second improvement on the default caching policy is to cache serialized objects

instead of raw Java objects, which you can accomplish using the MEMORY_ONLY_SER or

MEMORY_AND_DISK_SER storage levels. Caching serialized objects will slightly slow

down the cache operation due to the cost of serializing objects, but it can substan‐

tially reduce time spent on garbage collection in the JVM, since many individual

records can be stored as a single serialized buffer. This is because the cost of garbage

collection scales with the number of objects on the heap, not the number of bytes of

data, and this caching method will take many objects and serialize them into a single

giant buffer. Consider this option if you are caching large amounts of data (e.g., giga‐

bytes) as objects and/or seeing long garbage collection pauses. Such pauses would be

visible in the application UI under the GC Time column for each task.

Hardware Provisioning

The hardware resources you give to Spark will have a significant effect on the com‐

pletion time of your application. The main parameters that affect cluster sizing are

the amount of memory given to each executor, the number of cores for each execu‐

tor, the total number of executors, and the number of local disks to use for scratch

data.

In all deployment modes, executor memory is set with spark.executor.memory or

the --executor-memory flag to spark-submit . The options for number and cores of

executors differ depending on deployment mode. In YARN you can set spark.execu

Search WWH ::

Custom Search

Home