Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Anatomy of a Spark Job Run

Let's walk through what happens when we run a Spark job. At the highest level, there are

two independent entities: the driver , which hosts the application ( SparkContext ) and

schedules tasks for a job; and the executors , which are exclusive to the application, run for

the duration of the application, and execute the application's tasks. Usually the driver runs

as a client that is not managed by the cluster manager and the executors run on machines in

the cluster, but this isn't always the case (as we'll see in Executors and Cluster Managers ).

For the remainder of this section, we assume that the application's executors are already

running.

Job Submission

Figure 19-1 illustrates how Spark runs a job. A Spark job is submitted automatically when

an action (such as count() ) is performed on an RDD. Internally, this causes runJob()

to be called on the SparkContext (step 1 in Figure 19-1 ), which passes the call on to the

scheduler that runs as a part of the driver (step 2). The scheduler is made up of two parts: a

DAG scheduler that breaks down the job into a DAG of stages, and a task scheduler that is

responsible for submitting the tasks from each stage to the cluster.

Search WWH ::

Custom Search

Home