Database Reference
In-Depth Information
tor.cores or the --executor-cores flag and the --num-executors flag to determine
the total count. In Mesos and Standalone mode, Spark will greedily acquire as many
cores and executors as are offered by the scheduler. However, both Mesos and Stand‐
alone mode support setting spark.cores.max to limit the total number of cores
across all executors for an application. Local disks are used for scratch storage during
shuffle operations.
Broadly speaking, Spark applications will benefit from having more memory and
cores. Spark's architecture allows for linear scaling; adding twice the resources will
often make your application run twice as fast. An additional consideration when siz‐
ing a Spark application is whether you plan to cache intermediate datasets as part of
your workload. If you do plan to use caching, the more of your cached data can fit in
memory, the better the performance will be. The Spark storage UI will give details
about what fraction of your cached data is in memory. One approach is to start by
caching a subset of your data on a smaller cluster and extrapolate the total memory
you will need to fit larger amounts of the data in memory.
In addition to memory and cores, Spark uses local disk volumes to store intermediate
data required during shuffle operations along with RDD partitions that are spilled to
disk. Using a larger number of local disks can help accelerate the performance of
Spark applications. In YARN mode, the configuration for local disks is read directly
from YARN, which provides its own mechanism for specifying scratch storage direc‐
tories. In Standalone mode, you can set the SPARK_LOCAL_DIRS environment variable
in spark-env.sh when deploying the Standalone cluster and Spark applications will
inherit this config when they are launched. In Mesos mode, or if you are running in
another mode and want to override the cluster's default storage locations, you can set
the spark.local.dir option. In all cases you specify the local directories using a sin‐
gle comma-separated list. It is common to have one local directory for each disk vol‐
ume available to Spark. Writes will be evenly striped across all local directories
provided. Larger numbers of disks will provide higher overall throughput.
One caveat to the “more is better” guideline is when sizing memory for executors.
Using very large heap sizes can cause garbage collection pauses to hurt the through‐
put of a Spark job. It can sometimes be beneficial to request smaller executors (say, 64
GB or less) to mitigate this issue. Mesos and YARN can, out of the box, support pack‐
ing multiple, smaller executors onto the same physical host, so requesting smaller
executors doesn't mean your application will have fewer overall resources. In Spark's
Standalone mode, you need to launch multiple workers (determined using
SPARK_WORKER_INSTANCES ) for a single application to run more than one executor on
a host. This limitation will likely be removed in a later version of Spark. In addition
to using smaller executors, storing data in serialized form (see “Memory Manage‐
ment” on page 157 ) can also help alleviate garbage collection.
Search WWH ::




Custom Search