Database Reference
In-Depth Information
optimizations, see “Working on a Per-Partition Basis” on page 107 ). Some tasks may
spend almost all of their time reading data from an external storage system, and will
not benefit much from additional optimization in Spark since they are bottlenecked
on input read.
Storage: Information for RDDs that are persisted
The storage page contains information about persisted RDDs. An RDD is persisted if
someone called persist() on the RDD and it was later computed in some job. In
some cases, if many RDDs are cached, older ones will fall out of memory to make
space for newer ones. This page will tell you exactly what fraction of each RDD is
cached and the quantity of data cached in various storage media (disk, memory, etc.).
It can be helpful to scan this page and understand whether important datasets are fit‐
ting into memory or not.
Executors: A list of executors present in the application
This page lists the active executors in the application along with some metrics around
the processing and storage on each executor. One valuable use of this page is to con‐
firm that your application has the amount of resources you were expecting. A good
first step when debugging issues is to scan this page, since a misconfiguration result‐
ing in fewer executors than expected can, for obvious reasons, affect performance. It
can also be useful to look for executors with anomalous behaviors, such as a very
large ratio of failed to successful tasks. An executor with a high failure rate could
indicate a misconfiguration or failure on the physical host in question. Simply
removing that host from the cluster can improve performance.
Another feature in the executors page is the ability to collect a stack trace from execu‐
tors using the Thread Dump button (this feature was introduced in Spark 1.2). Visu‐
alizing the thread call stack of an executor can show exactly what code is executing at
an instant in time. If an executor is sampled several times in a short time period with
this feature, you can identify “hot spots,” or expensive sections, in user code. This
type of informal profiling can often detect inefficiencies in user code.
Environment: Debugging Spark's configuration
This page enumerates the set of active properties in the environment of your Spark
application. The configuration here represents the “ground truth” of your applica‐
tion's configuration. It can be helpful if you are debugging which configuration flags
are enabled, especially if you are using multiple configuration mechanisms. This page
will also enumerate JARs and files you've added to your application, which can be
useful when you're tracking down issues such as missing dependencies.
Search WWH ::




Custom Search