Database Reference
In-Depth Information
Anatomy of a Spark Job Run
Let's walk through what happens when we run a Spark job. At the highest level, there are
two independent entities: the
driver
, which hosts the application (
SparkContext
) and
schedules tasks for a job; and the
executors
, which are exclusive to the application, run for
the duration of the application, and execute the application's tasks. Usually the driver runs
as a client that is not managed by the cluster manager and the executors run on machines in
the cluster, but this isn't always the case (as we'll see in
Executors and Cluster Managers
).
For the remainder of this section, we assume that the application's executors are already
running.
Job Submission
Figure 19-1
illustrates how Spark runs a job. A Spark job is submitted automatically when
an action (such as
count()
) is performed on an RDD. Internally, this causes
runJob()
scheduler that runs as a part of the driver (step 2). The scheduler is made up of two parts: a
DAG scheduler that breaks down the job into a DAG of stages, and a task scheduler that is
responsible for submitting the tasks from each stage to the cluster.