Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

Most logging and instrumentation in Spark is expressed in terms of stages, tasks, and

shuffles. Understanding how user code compiles down into the bits of physical exe‐

cution is an advanced concept, but one that will help you immensely in tuning and

debugging applications.

To summarize, the following phases occur during Spark execution:

User code defines a DAG (directed acyclic graph) of RDDs

Operations on RDDs create new RDDs that refer back to their parents, thereby

creating a graph.

Actions force translation of the DAG to an execution plan

When you call an action on an RDD it must be computed. This requires comput‐

ing its parent RDDs as well. Spark's scheduler submits a job to compute all

needed RDDs. That job will have one or more stages , which are parallel waves of

computation composed of tasks . Each stage will correspond to one or more

RDDs in the DAG. A single stage can correspond to multiple RDDs due to

pipelining .

Tasks are scheduled and executed on a cluster

Stages are processed in order, with individual tasks launching to compute seg‐

ments of the RDD. Once the final stage is finished in a job, the action is com‐

plete.

In a given Spark application, this entire sequence of steps may occur many times in a

continuous fashion as new RDDs are created.

Finding Information

Spark records detailed progress information and performance metrics as applications

execute. These are presented to the user in two places: the Spark web UI and the log‐

files produced by the driver and executor processes.

Spark Web UI

The first stop for learning about the behavior and performance of a Spark application

is Spark's built-in web UI. This is available on the machine where the driver is run‐

ning at port 4040 by default. One caveat is that in the case of the YARN cluster mode,

where the application driver runs inside the cluster, you should access the UI through

the YARN ResourceManager, which proxies requests directly to the driver.

The Spark UI contains several different pages, and the exact format may differ across

Spark versions. As of Spark 1.2, the UI is composed of four different sections, which

we'll cover next.

Search WWH ::

Custom Search

Home