Database Reference
In-Depth Information
Example 8-8. Visualizing RDDs with toDebugString() in Scala
scala
>
input
.
toDebugString
res85
:
String
=
(
2
)
input
.
text
MappedRDD
[
292
]
at
textFile
at
<
console
>:
13
|
input
.
text
HadoopRDD
[
291
]
at
textFile
at
<
console
>:
13
scala
>
counts
.
toDebugString
res84
:
String
=
(
2
)
ShuffledRDD
[
296
]
at
reduceByKey
at
<
console
>:
17
+-(
2
)
MappedRDD
[
295
]
at
map
at
<
console
>:
17
|
FilteredRDD
[
294
]
at
filter
at
<
console
>:
15
|
MappedRDD
[
293
]
at
map
at
<
console
>:
15
|
input
.
text
MappedRDD
[
292
]
at
textFile
at
<
console
>:
13
|
input
.
text
HadoopRDD
[
291
]
at
textFile
at
<
console
>:
13
The first visualization shows the
input
RDD. We created this RDD by calling
sc.textFile()
. The lineage gives us some clues as to what
sc.textFile()
does since
it reveals which RDDs were created in the
textFile()
function. We can see that it
creates a
HadoopRDD
and then performs a map on it to create the returned RDD. The
lineage of
counts
is more complicated. That RDD has several ancestors, since there
are other operations that were performed on top of the
input
RDD, such as addi‐
tional maps, filtering, and reduction. The lineage of
counts
shown here is also dis‐
played graphically on the left side of
Figure 8-1
.
Before we perform an action, these RDDs simply store metadata that will help us
compute them later. To trigger computation, let's call an action on the
counts
RDD
and
collect()
it to the driver, as shown in
Example 8-9
.
Example 8-9. Collecting an RDD
scala
>
counts
.
collect
()
res86
:
Array
[(
String
,
Int
)]
=
Array
((
ERROR
,
1
),
(
INFO
,
4
),
(
WARN
,
2
))
Spark's scheduler creates a physical execution plan to compute the RDDs needed for
performing the action. Here when we call
collect()
on the RDD, every partition of
the RDD must be materialized and then transferred to the driver program. Spark's
scheduler starts at the final RDD being computed (in this case,
counts
) and works
backward to find what it must compute. It visits that RDD's parents, its parents'
parents, and so on, recursively to develop a physical plan necessary to compute all
ancestor RDDs. In the simplest case, the scheduler outputs a computation
stage
for
each RDD in this graph where the stage has
tasks
for each partition in that RDD.
Those stages are then executed in reverse order to compute the final required RDD.
In more complex cases, the physical set of stages will not be an exact 1:1 correspond‐
ence to the RDD graph. This can occur when the scheduler performs
pipelining
, or
collapsing of multiple RDDs into a single stage. Pipelining occurs when RDDs can be