Database Reference
In-Depth Information
driver communicates with a potentially large number of distributed workers called
executors . The driver runs in its own Java process and each executor is a separate Java
process. A driver and its executors are together termed a Spark application .
Figure 7-1. The components of a distributed Spark application
A Spark application is launched on a set of machines using an external service called
a cluster manager . As noted, Spark is packaged with a built-in cluster manager called
the Standalone cluster manager. Spark also works with Hadoop YARN and Apache
Mesos, two popular open source cluster managers.
The Driver
The driver is the process where the main() method of your program runs. It is the
process running the user code that creates a SparkContext, creates RDDs, and per‐
forms transformations and actions. When you launch a Spark shell, you've created a
driver program (if you remember, the Spark shell comes preloaded with a SparkCon‐
text called sc ). Once the driver terminates, the application is finished.
When the driver runs, it performs two duties:
Converting a user program into tasks
The Spark driver is responsible for converting a user program into units of physi‐
cal execution called tasks . At a high level, all Spark programs follow the same
structure: they create RDDs from some input, derive new RDDs from those
using transformations, and perform actions to collect or save data. A Spark pro‐
gram implicitly creates a logical directed acyclic graph (DAG) of operations.
When the driver runs, it converts this logical graph into a physical execution
plan.
 
Search WWH ::




Custom Search