Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

NOTE

Rather than using Hadoop RPC for remote calls, Spark uses Akka , an actor-based platform for building

highly scalable, event-driven distributed applications.

Executors also send status update messages to the driver when a task has finished or if a

task fails. In the latter case, the task scheduler will resubmit the task on another executor.

It will also launch speculative tasks for tasks that are running slowly, if this is enabled (it

is not by default).

Task Execution

An executor runs a task as follows (step 7). First, it makes sure that the JAR and file de-

pendencies for the task are up to date. The executor keeps a local cache of all the depend-

encies that previous tasks have used, so that it only downloads them when they have

changed. Second, it deserializes the task code (which includes the user's functions) from

the serialized bytes that were sent as a part of the launch task message. Third, the task

code is executed. Note that tasks are run in the same JVM as the executor, so there is no

process overhead for task launch. [ 134 ]

Tasks can return a result to the driver. The result is serialized and sent to the executor

backend, and then back to the driver as a status update message. A shuffle map task re-

turns information that allows the next stage to retrieve the output partitions, while a result

task returns the value of the result for the partition it ran on, which the driver assembles

into a final result to return to the user's program.

Search WWH ::

Custom Search

Home