Database Reference
In-Depth Information
MAPREDUCE IN SPARK
Despite the suggestive naming, Spark's map() and reduce() operations do not directly correspond to
the functions of the same name in Hadoop MapReduce. The general form of map and reduce in Hadoop
MapReduce is (from Chapter 8 ) :
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Notice that both functions can return multiple output pairs, indicated by the list notation. This is im-
plemented by the flatMap() operation in Spark (and Scala in general), which is like map() , but re-
moves a layer of nesting:
scala> val l = List(1, 2, 3)
l: List[Int] = List(1, 2, 3)
scala> l.map(a => List(a))
res0: List[List[Int]] = List(List(1), List(2), List(3))
scala> l.flatMap(a => List(a))
res1: List[Int] = List(1, 2, 3)
One naive way to try to emulate Hadoop MapReduce in Spark is with two flatMap() operations, sep-
arated by a groupByKey() and a sortByKey() to perform a MapReduce shuffle and sort:
val input : RDD [( K1 , V1 )] = ...
val mapOutput : RDD [( K2 , V2 )] = input . flatMap ( mapFn )
val shuffled : RDD [( K2 , Iterable [ V2 ])] = mapOutput . groupByKey (). sortByKey ()
val output : RDD [( K3 , V3 )] = shuffled . flatMap ( reduceFn )
Here the key type K2 needs to inherit from Scala's Ordering type to satisfy sortByKey() .
This example may be useful as a way to help understand the relationship between MapReduce and
Spark, but it should not be applied blindly. For one thing, the semantics are slightly different from Ha-
doop's MapReduce, since sortByKey() performs a total sort. This issue can be avoided by using re-
partitionAndSortWithinPartitions() to perform a partial sort. However, even this isn't as
efficient, since Spark uses two shuffles (one for the groupByKey() and one for the sort).
Rather than trying to reproduce MapReduce, it is better to use only the operations that you actually need.
For example, if you don't need keys to be sorted, you can omit the sortByKey() call (something that
is not possible in regular Hadoop MapReduce).
Similarly, groupByKey() is too general in most cases. Usually you only need the shuffle to aggregate
values, so you should use reduceByKey() , foldByKey() , or aggregateByKey() (covered in
the next section), which are more efficient than groupByKey() since they can also run as combiners
in the map task. Finally, flatMap() may not always be needed either, with map() being preferred if
there is always one return value, and filter() if there is zero or one.
Search WWH ::




Custom Search