Database Reference
In-Depth Information
computed from their parents without data movement. The lineage output shown in
Example 8-8 uses indentation levels to show where RDDs are going to be pipelined
together into physical stages. RDDs that exist at the same level of indentation as their
parents will be pipelined during physical execution. For instance, when we are com‐
puting counts , even though there are a large number of parent RDDs, there are only
two levels of indentation shown. This indicates that the physical execution will
require only two stages. The pipelining in this case is because there are several filter
and map operations in sequence. The right half of Figure 8-1 shows the two stages of
execution that are required to compute the counts RDD.
Figure 8-1. RDD transformations pipelined into physical stages
If you visit the application's web UI, you will see that two stages occur in order to
fulfill the collect() action. The Spark UI can be found at http://localhost:4040 if you
are running this example on your own machine. The UI is discussed in more detail
later in this chapter, but you can use it here to quickly see what stages are executing
during this program.
In addition to pipelining, Spark's internal scheduler may truncate the lineage of the
RDD graph if an existing RDD has already been persisted in cluster memory or on
disk. Spark can “short-circuit” in this case and just begin computing based on the
persisted RDD. A second case in which this truncation can happen is when an RDD
 
Search WWH ::




Custom Search