Database Reference
In-Depth Information
textFile()
, for instance. To simplify the diagram, some intermediate RDDs generated
internally by Spark have been omitted. For example, the RDD returned by
textFile()
is actually a
MappedRDD[String]
whose parent is a
Ha-
doopRDD[LongWritable, Text]
.
Notice that the
reduceByKey()
transformation spans two stages; this is because it is
implemented using a shuffle, and the reduce function runs as a combiner on the map side
(stage 1) and as a reducer on the reduce side (stage 2) — just like in MapReduce. Also
like MapReduce, Spark's shuffle implementation writes its output to partitioned files on
local disk (even for in-memory RDDs), and the files are fetched by the RDD in the next