Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

tor.cores or the --executor-cores flag and the --num-executors flag to determine

the total count. In Mesos and Standalone mode, Spark will greedily acquire as many

cores and executors as are offered by the scheduler. However, both Mesos and Stand‐

alone mode support setting spark.cores.max to limit the total number of cores

across all executors for an application. Local disks are used for scratch storage during

shuffle operations.

Broadly speaking, Spark applications will benefit from having more memory and

cores. Spark's architecture allows for linear scaling; adding twice the resources will

often make your application run twice as fast. An additional consideration when siz‐

ing a Spark application is whether you plan to cache intermediate datasets as part of

your workload. If you do plan to use caching, the more of your cached data can fit in

memory, the better the performance will be. The Spark storage UI will give details

about what fraction of your cached data is in memory. One approach is to start by

caching a subset of your data on a smaller cluster and extrapolate the total memory

you will need to fit larger amounts of the data in memory.

In addition to memory and cores, Spark uses local disk volumes to store intermediate

data required during shuffle operations along with RDD partitions that are spilled to

disk. Using a larger number of local disks can help accelerate the performance of

Spark applications. In YARN mode, the configuration for local disks is read directly

from YARN, which provides its own mechanism for specifying scratch storage direc‐

tories. In Standalone mode, you can set the SPARK_LOCAL_DIRS environment variable

in spark-env.sh when deploying the Standalone cluster and Spark applications will

inherit this config when they are launched. In Mesos mode, or if you are running in

another mode and want to override the cluster's default storage locations, you can set

the spark.local.dir option. In all cases you specify the local directories using a sin‐

gle comma-separated list. It is common to have one local directory for each disk vol‐

ume available to Spark. Writes will be evenly striped across all local directories

provided. Larger numbers of disks will provide higher overall throughput.

One caveat to the “more is better” guideline is when sizing memory for executors.

Using very large heap sizes can cause garbage collection pauses to hurt the through‐

put of a Spark job. It can sometimes be beneficial to request smaller executors (say, 64

GB or less) to mitigate this issue. Mesos and YARN can, out of the box, support pack‐

ing multiple, smaller executors onto the same physical host, so requesting smaller

executors doesn't mean your application will have fewer overall resources. In Spark's

Standalone mode, you need to launch multiple workers (determined using

SPARK_WORKER_INSTANCES ) for a single application to run more than one executor on

a host. This limitation will likely be removed in a later version of Spark. In addition

to using smaller executors, storing data in serialized form (see “Memory Manage‐

ment” on page 157 ) can also help alleviate garbage collection.

Search WWH ::

Custom Search

Home