Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

Option(s)

Default

Explanation

Class to use for serializing objects that will be sent

over the network or need to be cached in serialized

form. The default of Java Serialization works with

any serializable Java object but is quite slow, so we

recommend using org.apache.spark.seri

alizer.KryoSerializer and configuring

Kryo serialization when speed is necessary. Can be

any subclass of org.apache.spark.Serial

izer .

spark.serializer

org.apache.spark.seri

alizer.JavaSerializer

(random)

Allows setting integer port values to be used by a

running Spark applications. This is useful in clusters

where network access is secured. The possible

values of X are driver , fileserver , broad

cast , replClassServer , blockManager ,

and executor .

spark.[X].port

Set to true to enable event logging, which allows

completed Spark jobs to be viewed using a history

server. For more information about Spark's history

server, see the official documentation.

spark.eventLog.enabled

false

The storage location used for event logging, if

enabled. This needs to be in a globally visible

filesystem such as HDFS.

spark.eventLog.dir

file:///tmp/spark-

events

Almost all Spark configurations occur through the SparkConf construct, but one

important option doesn't. To set the local storage directories for Spark to use for

shuffle data (necessary for standalone and Mesos modes), you export the

SPARK_LOCAL_DIRS environment variable inside of conf/spark-env.sh to a comma-

separated list of storage locations. SPARK_LOCAL_DIRS is described in detail in “Hard‐

ware Provisioning” on page 158 . This is specified differently from other Spark

configurations because its value may be different on different physical hosts.

Components of Execution: Jobs, Tasks, and Stages

A first step in tuning and debugging Spark is to have a deeper understanding of the

system's internal design. In previous chapters you saw the “logical” representation of

RDDs and their partitions. When executing, Spark translates this logical representa‐

tion into a physical execution plan by merging multiple operations into tasks. Under‐

standing every aspect of Spark's execution is beyond the scope of this topic, but an

appreciation for the steps involved along with the relevant terminology can be helpful

when tuning and debugging jobs.

Learning Spark

Search WWH ::

Custom Search

Home