Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

is already materialized as a side effect of an earlier shuffle, even if it was not explicitly

persist() ed. This is an under-the-hood optimization that takes advantage of the fact

that Spark shuffle outputs are written to disk, and exploits the fact that many times

portions of the RDD graph are recomputed.

To see the effects of caching on physical execution, let's cache the counts RDD and

see how that truncates the execution graph for future actions ( Example 8-10 ). If you

revisit the UI, you should see that caching reduces the number of stages required

when executing future computations. Calling collect() a few more times will reveal

only one stage executing to perform the action.

Example 8-10. Computing an already cached RDD

// Cache the RDD

scala > counts . cache ()

// The first subsequent execution will again require 2 stages

scala > counts . collect ()

res87 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ), (##, 1 ),

(( empty , 2 ))

// This execution will only require a single stage

scala > counts . collect ()

res88 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ), (##, 1 ),

(( empty , 2 ))

The set of stages produced for a particular action is termed a job . In each case when

we invoke actions such as count() , we are creating a job composed of one or more

stages.

Once the stage graph is defined, tasks are created and dispatched to an internal

scheduler, which varies depending on the deployment mode being used. Stages in the

physical plan can depend on each other, based on the RDD lineage, so they will be

executed in a specific order. For instance, a stage that outputs shuffle data must occur

before one that relies on that data being present.

A physical stage will launch tasks that each do the same thing but on specific parti‐

tions of data. Each task internally performs the same steps:

1. Fetching its input, either from data storage (if the RDD is an input RDD), an

existing RDD (if the stage is based on already cached data), or shuffle outputs.

2. Performing the operation necessary to compute RDD(s) that it represents. For

instance, executing filter() or map() functions on the input data, or perform‐

ing grouping or reduction.

3. Writing output to a shuffle, to external storage, or back to the driver (if it is the

final RDD of an action such as count() ).

Search WWH ::

Custom Search

Home