Database Reference
In-Depth Information
last_run = cutoff
As long as we save and restore the last_run variable between aggregations, we can run these
aggregations as often as we like, since each aggregation operation is incremental (i.e., using
output mode 'reduce' ).
Sharding Concerns
When sharding, we need to ensure that we don't choose the incoming timestamp as a shard
key, but rather something that varies significantly in the most recent documents. In the pre-
vious example, we might consider using the userid as the most significant part of the shard
key.
To prevent a single, active user from creating a large chunk that MongoDB cannot split, we'll
use a compound shard key with username-timestamp on the events collection as follows:
>>>
>>> db . command ( 'shardcollection' , 'dbname.events' , {
...
... 'key' : { 'userid' : 1 , 'ts' : 1 } } )
{ "collectionsharded": "dbname.events", "ok" : 1 }
To shard the aggregated collections, we must use the _id field to work well with mapreduce ,
so we'll issue the following group of shard operations in the shell:
>>>
>>> db . command ( 'shardcollection' , 'dbname.stats.daily' , {
...
... 'key' : { '_id' : 1 } })
{ "collectionsharded": "dbname.stats.daily", "ok" : 1 }
>>>
>>> db . command ( 'shardcollection' , 'dbname.stats.weekly' , {
...
... 'key' : { '_id' : 1 } })
{ "collectionsharded": "dbname.stats.weekly", "ok" : 1 }
>>>
>>> db . command ( 'shardcollection' , 'dbname.stats.monthly' , {
...
... 'key' : { '_id' : 1 } })
{ "collectionsharded": "dbname.stats.monthly", "ok" : 1 }
>>>
>>> db . command ( 'shardcollection' , 'dbname.stats.yearly' , {
...
... 'key' : { '_id' : 1 } })
{ "collectionsharded": "dbname.stats.yearly", "ok" : 1 }
We also need to update the h_aggregate MapReduce wrapper to support sharded output by
adding 'sharded':True to the out argument. Our new h_aggregate now looks like this:
Search WWH ::




Custom Search