Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

MAPREDUCE IN SPARK

Despite the suggestive naming, Spark's map() and reduce() operations do not directly correspond to

the functions of the same name in Hadoop MapReduce. The general form of map and reduce in Hadoop

MapReduce is (from Chapter 8 ) :

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Notice that both functions can return multiple output pairs, indicated by the list notation. This is im-

plemented by the flatMap() operation in Spark (and Scala in general), which is like map() , but re-

moves a layer of nesting:

scala> val l = List(1, 2, 3)

l: List[Int] = List(1, 2, 3)

scala> l.map(a => List(a))

res0: List[List[Int]] = List(List(1), List(2), List(3))

scala> l.flatMap(a => List(a))

res1: List[Int] = List(1, 2, 3)

One naive way to try to emulate Hadoop MapReduce in Spark is with two flatMap() operations, sep-

arated by a groupByKey() and a sortByKey() to perform a MapReduce shuffle and sort:

val input : RDD [( K1 , V1 )] = ...

val mapOutput : RDD [( K2 , V2 )] = input . flatMap ( mapFn )

val shuffled : RDD [( K2 , Iterable [ V2 ])] = mapOutput . groupByKey (). sortByKey ()

val output : RDD [( K3 , V3 )] = shuffled . flatMap ( reduceFn )

Here the key type K2 needs to inherit from Scala's Ordering type to satisfy sortByKey() .

This example may be useful as a way to help understand the relationship between MapReduce and

Spark, but it should not be applied blindly. For one thing, the semantics are slightly different from Ha-

doop's MapReduce, since sortByKey() performs a total sort. This issue can be avoided by using re-

partitionAndSortWithinPartitions() to perform a partial sort. However, even this isn't as

efficient, since Spark uses two shuffles (one for the groupByKey() and one for the sort).

Rather than trying to reproduce MapReduce, it is better to use only the operations that you actually need.

For example, if you don't need keys to be sorted, you can omit the sortByKey() call (something that

is not possible in regular Hadoop MapReduce).

Similarly, groupByKey() is too general in most cases. Usually you only need the shuffle to aggregate

values, so you should use reduceByKey() , foldByKey() , or aggregateByKey() (covered in

the next section), which are more efficient than groupByKey() since they can also run as combiners

in the map task. Finally, flatMap() may not always be needed either, with map() being preferred if

there is always one return value, and filter() if there is zero or one.

Search WWH ::

Custom Search

Home