Database Reference
In-Depth Information
MAPREDUCE IN SPARK
Despite the suggestive naming, Spark's
map()
and
reduce()
operations do not directly correspond to
the functions of the same name in Hadoop MapReduce. The general form of map and reduce in Hadoop
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Notice that both functions can return multiple output pairs, indicated by the
list
notation. This is im-
plemented by the
flatMap()
operation in Spark (and Scala in general), which is like
map()
, but re-
moves a layer of nesting:
scala>
val l = List(1, 2, 3)
l: List[Int] = List(1, 2, 3)
scala>
l.map(a => List(a))
res0: List[List[Int]] = List(List(1), List(2), List(3))
scala>
l.flatMap(a => List(a))
res1: List[Int] = List(1, 2, 3)
One naive way to try to emulate Hadoop MapReduce in Spark is with two
flatMap()
operations, sep-
arated by a
groupByKey()
and a
sortByKey()
to perform a MapReduce shuffle and sort:
val
input
:
RDD
[(
K1
,
V1
)]
=
...
val
mapOutput
:
RDD
[(
K2
,
V2
)]
=
input
.
flatMap
(
mapFn
)
val
shuffled
:
RDD
[(
K2
,
Iterable
[
V2
])]
=
mapOutput
.
groupByKey
().
sortByKey
()
val
output
:
RDD
[(
K3
,
V3
)]
=
shuffled
.
flatMap
(
reduceFn
)
Here the key type
K2
needs to inherit from Scala's
Ordering
type to satisfy
sortByKey()
.
This example may be useful as a way to help understand the relationship between MapReduce and
Spark, but it should not be applied blindly. For one thing, the semantics are slightly different from Ha-
doop's MapReduce, since
sortByKey()
performs a total sort. This issue can be avoided by using
re-
partitionAndSortWithinPartitions()
to perform a partial sort. However, even this isn't as
efficient, since Spark uses two shuffles (one for the
groupByKey()
and one for the sort).
Rather than trying to reproduce MapReduce, it is better to use only the operations that you actually need.
For example, if you don't need keys to be sorted, you can omit the
sortByKey()
call (something that
is not possible in regular Hadoop MapReduce).
Similarly,
groupByKey()
is too general in most cases. Usually you only need the shuffle to aggregate
values, so you should use
reduceByKey()
,
foldByKey()
, or
aggregateByKey()
(covered in
the next section), which are more efficient than
groupByKey()
since they can also run as combiners
in the map task. Finally,
flatMap()
may not always be needed either, with
map()
being preferred if
there is always one return value, and
filter()
if there is zero or one.