Database Reference
In-Depth Information
Listing 3.6
Error generated without an index on “timestamp”
> db.tweets.find().sort({ "timestamp" : -1})
error: {
"$err" : "too much data for sort() with no index. add an
index or specify a smaller limit" ,
"code" : 10128
}
Input
Split
Map
Shuffle
Reduce
Output
@jim
@sam
@bob, 1
@bob, 1
T1: @jim @sam
@jim, 1
@jim, 1
@jim, 2
@bob
T1: @jim @sam
T2: @bob @sam
@jim
T3: @sam @mike
T2: @bob @sam
@jim
@bob, 1
@mike, 1
@jim, 2
@sam, 3
@jim
@sam
@mike, 1
@mike, 1
@mike
@sam
@sam, 1
@sam, 1
@sam, 1
T3: @sam, @mike
@sam, 3
Fig. 3.3 MapReduce framework. The steps in white are implemented by the reader. MongoDB
takes the documents from the database and runs each one through the map function. It then sorts
the emitted keys and runs each key and its values through the reduce function. The output from the
reduce function is stored in another collection
3.11
Grouping Documents: Identifying the Most
Mentioned Users
With some simple use of the find and count functions, you can learn volumes about
the data you have collected. However, when it comes to aggregating data, we will
need to employ another set of functions, collectively called MapReduce.
MapReduce consists of two steps: “Map”, and “Reduce”. In the map step, data
is extracted, filtered, and processed to be sent to the reduce function. The mapper
processes in the map step emit a series of key/value pairs. These key value pairs are
sorted, and the values associated with each unique key are sent to a reduce process.
Each reduce process then computes a value for the key it is handed. A diagram for
this process is shown in Fig. 3.3 . The MongoDB code for this example is shown in
Listing 3.7 .
 
Search WWH ::




Custom Search