Database Reference
In-Depth Information
Spark performs several optimizations, such as “pipelining” map transformations
together to merge them, and converts the execution graph into a set of stages .
Each stage, in turn, consists of multiple tasks . The tasks are bundled up and pre‐
pared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a
typical user program can launch hundreds or thousands of individual tasks.
Scheduling tasks on executors
Given a physical execution plan, a Spark driver must coordinate the scheduling
of individual tasks on executors. When executors are started they register them‐
selves with the driver, so it has a complete view of the application's executors at
all times. Each executor represents a process capable of running tasks and storing
RDD data.
The Spark driver will look at the current set of executors and try to schedule each
task in an appropriate location, based on data placement. When tasks execute,
they may have a side effect of storing cached data. The driver also tracks the loca‐
tion of cached data and uses it to schedule future tasks that access that data.
The driver exposes information about the running Spark application through a
web interface, which by default is available at port 4040. For instance, in local
mode, this UI is available at http://localhost:4040 . We'll cover Spark's web UI and
its scheduling mechanisms in more detail in Chapter 8 .
Executors
Spark executors are worker processes responsible for running the individual tasks in
a given Spark job. Executors are launched once at the beginning of a Spark applica‐
tion and typically run for the entire lifetime of an application, though Spark applica‐
tions can continue if executors fail. Executors have two roles. First, they run the tasks
that make up the application and return results to the driver. Second, they provide
in-memory storage for RDDs that are cached by user programs, through a service
called the Block Manager that lives within each executor. Because RDDs are cached
directly inside of executors, tasks can run alongside the cached data.
Drivers and Executors in Local Mode
For most of this topic, you've run examples in Spark's local mode.
In this mode, the Spark driver runs along with an executor in the
same Java process. This is a special case; executors typically each
run in a dedicated process.
Cluster Manager
So far we've discussed drivers and executors in somewhat abstract terms. But how do
drivers and executor processes initially get launched? Spark depends on a cluster
Search WWH ::




Custom Search