Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

computed from their parents without data movement. The lineage output shown in

Example 8-8 uses indentation levels to show where RDDs are going to be pipelined

together into physical stages. RDDs that exist at the same level of indentation as their

parents will be pipelined during physical execution. For instance, when we are com‐

puting counts , even though there are a large number of parent RDDs, there are only

two levels of indentation shown. This indicates that the physical execution will

require only two stages. The pipelining in this case is because there are several filter

and map operations in sequence. The right half of Figure 8-1 shows the two stages of

execution that are required to compute the counts RDD.

Figure 8-1. RDD transformations pipelined into physical stages

If you visit the application's web UI, you will see that two stages occur in order to

fulfill the collect() action. The Spark UI can be found at http://localhost:4040 if you

are running this example on your own machine. The UI is discussed in more detail

later in this chapter, but you can use it here to quickly see what stages are executing

during this program.

In addition to pipelining, Spark's internal scheduler may truncate the lineage of the

RDD graph if an existing RDD has already been persisted in cluster memory or on

disk. Spark can “short-circuit” in this case and just begin computing based on the

persisted RDD. A second case in which this truncation can happen is when an RDD

Search WWH ::

Custom Search

Home