Database Reference
In-Depth Information
is already materialized as a side effect of an earlier shuffle, even if it was not explicitly
persist() ed. This is an under-the-hood optimization that takes advantage of the fact
that Spark shuffle outputs are written to disk, and exploits the fact that many times
portions of the RDD graph are recomputed.
To see the effects of caching on physical execution, let's cache the counts RDD and
see how that truncates the execution graph for future actions ( Example 8-10 ). If you
revisit the UI, you should see that caching reduces the number of stages required
when executing future computations. Calling collect() a few more times will reveal
only one stage executing to perform the action.
Example 8-10. Computing an already cached RDD
// Cache the RDD
scala > counts . cache ()
// The first subsequent execution will again require 2 stages
scala > counts . collect ()
res87 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ), (##, 1 ),
(( empty , 2 ))
// This execution will only require a single stage
scala > counts . collect ()
res88 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ), (##, 1 ),
(( empty , 2 ))
The set of stages produced for a particular action is termed a job . In each case when
we invoke actions such as count() , we are creating a job composed of one or more
stages.
Once the stage graph is defined, tasks are created and dispatched to an internal
scheduler, which varies depending on the deployment mode being used. Stages in the
physical plan can depend on each other, based on the RDD lineage, so they will be
executed in a specific order. For instance, a stage that outputs shuffle data must occur
before one that relies on that data being present.
A physical stage will launch tasks that each do the same thing but on specific parti‐
tions of data. Each task internally performs the same steps:
1. Fetching its input, either from data storage (if the RDD is an input RDD), an
existing RDD (if the stage is based on already cached data), or shuffle outputs.
2. Performing the operation necessary to compute RDD(s) that it represents. For
instance, executing filter() or map() functions on the input data, or perform‐
ing grouping or reduction.
3. Writing output to a shuffle, to external storage, or back to the driver (if it is the
final RDD of an action such as count() ).
Search WWH ::




Custom Search