Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

Example 8-8. Visualizing RDDs with toDebugString() in Scala

scala > input . toDebugString

res85 : String =

( 2 ) input . text MappedRDD [ 292 ] at textFile at < console >: 13

| input . text HadoopRDD [ 291 ] at textFile at < console >: 13

scala > counts . toDebugString

res84 : String =

( 2 ) ShuffledRDD [ 296 ] at reduceByKey at < console >: 17

+-( 2 ) MappedRDD [ 295 ] at map at < console >: 17

| FilteredRDD [ 294 ] at filter at < console >: 15

| MappedRDD [ 293 ] at map at < console >: 15

| input . text MappedRDD [ 292 ] at textFile at < console >: 13

| input . text HadoopRDD [ 291 ] at textFile at < console >: 13

The first visualization shows the input RDD. We created this RDD by calling

sc.textFile() . The lineage gives us some clues as to what sc.textFile() does since

it reveals which RDDs were created in the textFile() function. We can see that it

creates a HadoopRDD and then performs a map on it to create the returned RDD. The

lineage of counts is more complicated. That RDD has several ancestors, since there

are other operations that were performed on top of the input RDD, such as addi‐

tional maps, filtering, and reduction. The lineage of counts shown here is also dis‐

played graphically on the left side of Figure 8-1 .

Before we perform an action, these RDDs simply store metadata that will help us

compute them later. To trigger computation, let's call an action on the counts RDD

and collect() it to the driver, as shown in Example 8-9 .

Example 8-9. Collecting an RDD

scala > counts . collect ()

res86 : Array [( String , Int )] = Array (( ERROR , 1 ), ( INFO , 4 ), ( WARN , 2 ))

Spark's scheduler creates a physical execution plan to compute the RDDs needed for

performing the action. Here when we call collect() on the RDD, every partition of

the RDD must be materialized and then transferred to the driver program. Spark's

scheduler starts at the final RDD being computed (in this case, counts ) and works

backward to find what it must compute. It visits that RDD's parents, its parents'

parents, and so on, recursively to develop a physical plan necessary to compute all

ancestor RDDs. In the simplest case, the scheduler outputs a computation stage for

each RDD in this graph where the stage has tasks for each partition in that RDD.

Those stages are then executed in reverse order to compute the final required RDD.

In more complex cases, the physical set of stages will not be an exact 1:1 correspond‐

ence to the RDD graph. This can occur when the scheduler performs pipelining , or

collapsing of multiple RDDs into a single stage. Pipelining occurs when RDDs can be

Search WWH ::

Custom Search

Home