Database Reference
In-Depth Information
DAG Construction
To understand how a job is broken up into stages, we need to look at the type of tasks that
can run in a stage. There are two types: shuffle map tasks and result tasks . The name of
the task type indicates what Spark does with the task's output:
Shuffle map tasks
As the name suggests, shuffle map tasks are like the map-side part of the shuffle in
MapReduce. Each shuffle map task runs a computation on one RDD partition and,
based on a partitioning function, writes its output to a new set of partitions, which are
then fetched in a later stage (which could be composed of either shuffle map tasks or
result tasks). Shuffle map tasks run in all stages except the final stage.
Result tasks
Result tasks run in the final stage that returns the result to the user's program (such as
the result of a count() ). Each result task runs a computation on its RDD partition,
then sends the result back to the driver, and the driver assembles the results from each
partition into a final result (which may be Unit , in the case of actions like
saveAsTextFile() ).
The simplest Spark job is one that does not need a shuffle and therefore has just a single
stage composed of result tasks. This is like a map-only job in MapReduce.
More complex jobs involve grouping operations and require one or more shuffle stages.
For example, consider the following job for calculating a histogram of word counts for
text files stored in inputPath (one word per line):
val hist : Map [ Int , Long ] = sc . textFile ( inputPath )
. map ( word => ( word . toLowerCase (), 1 ))
. reduceByKey (( a , b ) => a + b )
. map ( _ . swap )
. countByKey ()
The first two transformations, map() and reduceByKey() , perform a word count. The
third transformation is a map() that swaps the key and value in each pair, to give (count,
word) pairs, and the final operation is the countByKey() action, which returns the
number of words with each count (i.e., a frequency distribution of word counts).
Spark's DAG scheduler turns this job into two stages since the reduceByKey() opera-
tion forces a shuffle stage. [ 131 ] The resulting DAG is illustrated in Figure 19-2 .
The RDDs within each stage are also, in general, arranged in a DAG. The diagram shows
the type of the RDD and the operation that created it. RDD[String] was created by
Search WWH ::




Custom Search