Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

Spark performs several optimizations, such as “pipelining” map transformations

together to merge them, and converts the execution graph into a set of stages .

Each stage, in turn, consists of multiple tasks . The tasks are bundled up and pre‐

pared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a

typical user program can launch hundreds or thousands of individual tasks.

Scheduling tasks on executors

Given a physical execution plan, a Spark driver must coordinate the scheduling

of individual tasks on executors. When executors are started they register them‐

selves with the driver, so it has a complete view of the application's executors at

all times. Each executor represents a process capable of running tasks and storing

RDD data.

The Spark driver will look at the current set of executors and try to schedule each

task in an appropriate location, based on data placement. When tasks execute,

they may have a side effect of storing cached data. The driver also tracks the loca‐

tion of cached data and uses it to schedule future tasks that access that data.

The driver exposes information about the running Spark application through a

web interface, which by default is available at port 4040. For instance, in local

mode, this UI is available at http://localhost:4040 . We'll cover Spark's web UI and

its scheduling mechanisms in more detail in Chapter 8 .

Executors

Spark executors are worker processes responsible for running the individual tasks in

a given Spark job. Executors are launched once at the beginning of a Spark applica‐

tion and typically run for the entire lifetime of an application, though Spark applica‐

tions can continue if executors fail. Executors have two roles. First, they run the tasks

that make up the application and return results to the driver. Second, they provide

in-memory storage for RDDs that are cached by user programs, through a service

called the Block Manager that lives within each executor. Because RDDs are cached

directly inside of executors, tasks can run alongside the cached data.

Drivers and Executors in Local Mode

For most of this topic, you've run examples in Spark's local mode.

In this mode, the Spark driver runs along with an executor in the

same Java process. This is a special case; executors typically each

run in a dedicated process.

Cluster Manager

So far we've discussed drivers and executors in somewhat abstract terms. But how do

drivers and executor processes initially get launched? Spark depends on a cluster

Search WWH ::

Custom Search

Home