Database Reference
In-Depth Information
everything left in the JVM heap after the space for RDD storage and shuffle stor‐
age are allocated.
By default Spark will leave 60% of space for RDD storage, 20% for shuffle memory,
and the remaining 20% for user programs. In some cases users can tune these options
for better performance. If your user code is allocating very large objects, it might
make sense to decrease the storage and shuffle regions to avoid running out of
memory.
In addition to tweaking memory regions, you can improve certain elements of
Spark's default caching behavior for some workloads. Spark's default cache() opera‐
tion persists memory using the MEMORY_ONLY storage level. This means that if there is
not enough space to cache new RDD partitions, old ones will simply be deleted and,
if they are needed again, they will be recomputed. It is sometimes better to call per
sist() with the MEMORY_AND_DISK storage level, which instead drops RDD partitions
to disk and simply reads them back to memory from a local store if they are needed
again. This can be much cheaper than recomputing blocks and can lead to more pre‐
dictable performance. This is particularly useful if your RDD partitions are very
expensive to recompute (for instance, if you are reading data from a database). The
full list of possible storage levels is given in Table 3-6 .
A second improvement on the default caching policy is to cache serialized objects
instead of raw Java objects, which you can accomplish using the MEMORY_ONLY_SER or
MEMORY_AND_DISK_SER storage levels. Caching serialized objects will slightly slow
down the cache operation due to the cost of serializing objects, but it can substan‐
tially reduce time spent on garbage collection in the JVM, since many individual
records can be stored as a single serialized buffer. This is because the cost of garbage
collection scales with the number of objects on the heap, not the number of bytes of
data, and this caching method will take many objects and serialize them into a single
giant buffer. Consider this option if you are caching large amounts of data (e.g., giga‐
bytes) as objects and/or seeing long garbage collection pauses. Such pauses would be
visible in the application UI under the GC Time column for each task.
Hardware Provisioning
The hardware resources you give to Spark will have a significant effect on the com‐
pletion time of your application. The main parameters that affect cluster sizing are
the amount of memory given to each executor, the number of cores for each execu‐
tor, the total number of executors, and the number of local disks to use for scratch
data.
In all deployment modes, executor memory is set with spark.executor.memory or
the --executor-memory flag to spark-submit . The options for number and cores of
executors differ depending on deployment mode. In YARN you can set spark.execu
Search WWH ::




Custom Search