Database Reference
In-Depth Information
Example 8-8. Visualizing RDDs with toDebugString() in Scala
scala > input . toDebugString
res85 : String =
( 2 ) input . text MappedRDD [ 292 ] at textFile at < console >: 13
| input . text HadoopRDD [ 291 ] at textFile at < console >: 13
scala > counts . toDebugString
res84 : String =
( 2 ) ShuffledRDD [ 296 ] at reduceByKey at < console >: 17
+-( 2 ) MappedRDD [ 295 ] at map at < console >: 17
| FilteredRDD [ 294 ] at filter at < console >: 15
| MappedRDD [ 293 ] at map at < console >: 15
| input . text MappedRDD [ 292 ] at textFile at < console >: 13
| input . text HadoopRDD [ 291 ] at textFile at < console >: 13
The first visualization shows the input RDD. We created this RDD by calling
sc.textFile() . The lineage gives us some clues as to what sc.textFile() does since
it reveals which RDDs were created in the textFile() function. We can see that it
creates a HadoopRDD and then performs a map on it to create the returned RDD. The
lineage of counts is more complicated. That RDD has several ancestors, since there
are other operations that were performed on top of the input RDD, such as addi‐
tional maps, filtering, and reduction. The lineage of counts shown here is also dis‐
played graphically on the left side of Figure 8-1 .
Before we perform an action, these RDDs simply store metadata that will help us
compute them later. To trigger computation, let's call an action on the counts RDD
and collect() it to the driver, as shown in Example 8-9 .
Example 8-9. Collecting an RDD
scala > counts . collect ()
res86 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ))
Spark's scheduler creates a physical execution plan to compute the RDDs needed for
performing the action. Here when we call collect() on the RDD, every partition of
the RDD must be materialized and then transferred to the driver program. Spark's
scheduler starts at the final RDD being computed (in this case, counts ) and works
backward to find what it must compute. It visits that RDD's parents, its parents'
parents, and so on, recursively to develop a physical plan necessary to compute all
ancestor RDDs. In the simplest case, the scheduler outputs a computation stage for
each RDD in this graph where the stage has tasks for each partition in that RDD.
Those stages are then executed in reverse order to compute the final required RDD.
In more complex cases, the physical set of stages will not be an exact 1:1 correspond‐
ence to the RDD graph. This can occur when the scheduler performs pipelining , or
collapsing of multiple RDDs into a single stage. Pipelining occurs when RDDs can be
Search WWH ::




Custom Search