Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

optimizations, see “Working on a Per-Partition Basis” on page 107 ). Some tasks may

spend almost all of their time reading data from an external storage system, and will

not benefit much from additional optimization in Spark since they are bottlenecked

on input read.

Storage: Information for RDDs that are persisted

The storage page contains information about persisted RDDs. An RDD is persisted if

someone called persist() on the RDD and it was later computed in some job. In

some cases, if many RDDs are cached, older ones will fall out of memory to make

space for newer ones. This page will tell you exactly what fraction of each RDD is

cached and the quantity of data cached in various storage media (disk, memory, etc.).

It can be helpful to scan this page and understand whether important datasets are fit‐

ting into memory or not.

Executors: A list of executors present in the application

This page lists the active executors in the application along with some metrics around

the processing and storage on each executor. One valuable use of this page is to con‐

firm that your application has the amount of resources you were expecting. A good

first step when debugging issues is to scan this page, since a misconfiguration result‐

ing in fewer executors than expected can, for obvious reasons, affect performance. It

can also be useful to look for executors with anomalous behaviors, such as a very

large ratio of failed to successful tasks. An executor with a high failure rate could

indicate a misconfiguration or failure on the physical host in question. Simply

removing that host from the cluster can improve performance.

Another feature in the executors page is the ability to collect a stack trace from execu‐

tors using the Thread Dump button (this feature was introduced in Spark 1.2). Visu‐

alizing the thread call stack of an executor can show exactly what code is executing at

an instant in time. If an executor is sampled several times in a short time period with

this feature, you can identify “hot spots,” or expensive sections, in user code. This

type of informal profiling can often detect inefficiencies in user code.

Environment: Debugging Spark's configuration

This page enumerates the set of active properties in the environment of your Spark

application. The configuration here represents the “ground truth” of your applica‐

tion's configuration. It can be helpful if you are debugging which configuration flags

are enabled, especially if you are using multiple configuration mechanisms. This page

will also enumerate JARs and files you've added to your application, which can be

useful when you're tracking down issues such as missing dependencies.

Search WWH ::

Custom Search

Home