Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

textFile() , for instance. To simplify the diagram, some intermediate RDDs generated

internally by Spark have been omitted. For example, the RDD returned by textFile()

is actually a MappedRDD[String] whose parent is a Ha-

doopRDD[LongWritable, Text] .

Notice that the reduceByKey() transformation spans two stages; this is because it is

implemented using a shuffle, and the reduce function runs as a combiner on the map side

(stage 1) and as a reducer on the reduce side (stage 2) — just like in MapReduce. Also

like MapReduce, Spark's shuffle implementation writes its output to partitioned files on

local disk (even for in-memory RDDs), and the files are fetched by the RDD in the next

stage. [ 132 ]

Search WWH ::

Custom Search

Home