Database Reference
In-Depth Information
Anatomy of a Spark Job Run
Let's walk through what happens when we run a Spark job. At the highest level, there are
two independent entities: the driver , which hosts the application ( SparkContext ) and
schedules tasks for a job; and the executors , which are exclusive to the application, run for
the duration of the application, and execute the application's tasks. Usually the driver runs
as a client that is not managed by the cluster manager and the executors run on machines in
the cluster, but this isn't always the case (as we'll see in Executors and Cluster Managers ).
For the remainder of this section, we assume that the application's executors are already
running.
Job Submission
Figure 19-1 illustrates how Spark runs a job. A Spark job is submitted automatically when
an action (such as count() ) is performed on an RDD. Internally, this causes runJob()
to be called on the SparkContext (step 1 in Figure 19-1 ), which passes the call on to the
scheduler that runs as a part of the driver (step 2). The scheduler is made up of two parts: a
DAG scheduler that breaks down the job into a DAG of stages, and a task scheduler that is
responsible for submitting the tasks from each stage to the cluster.
Search WWH ::




Custom Search