Database Reference
In-Depth Information
5.4.4
Map-reduce
You may be wondering why MongoDB would support both
group
and
map-reduce
,
since they provide such similar functionality. In fact,
group
preceded
map-reduce
as
MongoDB's sole aggregator.
map-reduce
was added later on for a couple of related
reasons. First, operations in the map-reduce style were becoming more mainstream,
and it seemed prudent to integrate this budding way of thinking into the product.
11
The second reason was much more practical: iterating over large data sets, especially
in a sharded configuration, required a distributed aggregator. Map-reduce (the para-
digm) provided just that.
map-reduce
includes many options. Here they are in all their byzantine detail:
map
—A JavaScript function to be applied to each document. This function must
call
emit()
to select the keys and values to be aggregated. Within the function
context, the value of
this
is a reference to the current document. So, for exam-
ple, if you wanted to group your results by user
ID
and produce totals on a vote
count and document count, then your map function would look like this:
function() {
emit(this.user_id, {vote_sum: this.vote_count, doc_count: 1});
}
reduce
—A JavaScript function that receives a key and a list of values. This func-
tion must always return a value having the same structure as each of the values
provided in the values array. A
reduce
function typically iterates over the list of
values and aggregates them in the process. Sticking to our example, here's how
you'd reduce the mapped values:
function(key, values) {
var vote_sum = 0;
var doc_sum = 0;
values.forEach(function(value) {
vote_sum += value.vote_sum;
doc_sum += value.doc_sum;
});
return {vote_sum: vote_sum, doc_sum: doc_sum};
}
Note that the value of the key parameter frequently isn't used in the aggrega-
tion itself.
query
—A query selector that filters the collection to be mapped. This parame-
ter serves the same function as
group
's
cond
parameter.
11
A lot of developers first saw map-reduce in a famous paper by Google on distributed computations (http://
labs.google.com/papers/mapreduce.html). The ideas in this paper helped form the basis for Hadoop, an
open source framework that uses distributed map-reduce to process large data sets. The map-reduce idea then
spread. CouchDB, for instance, employed a map-reduce paradigm for declaring indexes.