Database Reference
In-Depth Information
textFile() , for instance. To simplify the diagram, some intermediate RDDs generated
internally by Spark have been omitted. For example, the RDD returned by textFile()
is actually a MappedRDD[String] whose parent is a Ha-
doopRDD[LongWritable, Text] .
Notice that the reduceByKey() transformation spans two stages; this is because it is
implemented using a shuffle, and the reduce function runs as a combiner on the map side
(stage 1) and as a reducer on the reduce side (stage 2) — just like in MapReduce. Also
like MapReduce, Spark's shuffle implementation writes its output to partitioned files on
local disk (even for in-memory RDDs), and the files are fetched by the RDD in the next
stage. [ 132 ]
Search WWH ::




Custom Search