Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

DAG Construction

To understand how a job is broken up into stages, we need to look at the type of tasks that

can run in a stage. There are two types: shuffle map tasks and result tasks . The name of

the task type indicates what Spark does with the task's output:

Shuffle map tasks

As the name suggests, shuffle map tasks are like the map-side part of the shuffle in

MapReduce. Each shuffle map task runs a computation on one RDD partition and,

based on a partitioning function, writes its output to a new set of partitions, which are

then fetched in a later stage (which could be composed of either shuffle map tasks or

result tasks). Shuffle map tasks run in all stages except the final stage.

Result tasks

Result tasks run in the final stage that returns the result to the user's program (such as

the result of a count() ). Each result task runs a computation on its RDD partition,

then sends the result back to the driver, and the driver assembles the results from each

partition into a final result (which may be Unit , in the case of actions like

saveAsTextFile() ).

The simplest Spark job is one that does not need a shuffle and therefore has just a single

stage composed of result tasks. This is like a map-only job in MapReduce.

More complex jobs involve grouping operations and require one or more shuffle stages.

For example, consider the following job for calculating a histogram of word counts for

text files stored in inputPath (one word per line):

val hist : Map [ Int , Long ] = sc . textFile ( inputPath )

. map ( word => ( word . toLowerCase (), 1 ))

. reduceByKey (( a , b ) => a + b )

. map ( _ . swap )

. countByKey ()

The first two transformations, map() and reduceByKey() , perform a word count. The

third transformation is a map() that swaps the key and value in each pair, to give (count,

word) pairs, and the final operation is the countByKey() action, which returns the

number of words with each count (i.e., a frequency distribution of word counts).

Spark's DAG scheduler turns this job into two stages since the reduceByKey() opera-

tion forces a shuffle stage. [ 131 ] The resulting DAG is illustrated in Figure 19-2 .

The RDDs within each stage are also, in general, arranged in a DAG. The diagram shows

the type of the RDD and the operation that created it. RDD[String] was created by

Search WWH ::

Custom Search

Home