Storing Twitter Data - Twitter Data Analytics - page 30

Database Reference

In-Depth Information

Listing 3.6

Error generated without an index on “timestamp”

> db.tweets.find().sort({ "timestamp" : -1})

error: {

"$err" : "too much data for sort() with no index. add an

index or specify a smaller limit" ,

"code" : 10128

}

Input

Split

Map

Shuffle

Reduce

Output

@jim

@sam

@bob, 1

@bob, 1

T1: @jim @sam

@jim, 1

@jim, 1

@jim, 2

@bob

T1: @jim @sam

T2: @bob @sam

@jim

T3: @sam @mike

T2: @bob @sam

@jim

@bob, 1

@mike, 1

@jim, 2

@sam, 3

@jim

@sam

@mike, 1

@mike, 1

@mike

@sam

@sam, 1

@sam, 1

@sam, 1

T3: @sam, @mike

@sam, 3

Fig. 3.3 MapReduce framework. The steps in white are implemented by the reader. MongoDB

takes the documents from the database and runs each one through the map function. It then sorts

the emitted keys and runs each key and its values through the reduce function. The output from the

reduce function is stored in another collection

3.11

Grouping Documents: Identifying the Most

Mentioned Users

With some simple use of the find and count functions, you can learn volumes about

the data you have collected. However, when it comes to aggregating data, we will

need to employ another set of functions, collectively called MapReduce.

MapReduce consists of two steps: “Map”, and “Reduce”. In the map step, data

is extracted, filtered, and processed to be sent to the reduce function. The mapper

processes in the map step emit a series of key/value pairs. These key value pairs are

sorted, and the values associated with each unique key are sent to a reduce process.

Each reduce process then computes a value for the key it is handed. A diagram for

this process is shown in Fig. 3.3 . The MongoDB code for this example is shown in

Listing 3.7 .

Next Page

Twitter Data Analytics

Search WWH ::

Custom Search

Home