Database Reference
In-Depth Information
last_run
=
cutoff
As long as we save and restore the
last_run
variable between aggregations, we can run these
aggregations as often as we like, since each aggregation operation is incremental (i.e., using
output mode
'reduce'
).
Sharding Concerns
When sharding, we need to ensure that we don't choose the incoming timestamp as a shard
key, but rather something that varies significantly in the most recent documents. In the pre-
vious example, we might consider using the
userid
as the most significant part of the shard
key.
To prevent a single, active user from creating a large chunk that MongoDB cannot split, we'll
use a compound shard key with username-timestamp on the events collection as follows:
>>>
>>>
db
.
command
(
'shardcollection'
,
'dbname.events'
, {
...
...
'key'
: {
'userid'
:
1
,
'ts'
:
1
} } )
{ "collectionsharded": "dbname.events", "ok" : 1 }
To shard the aggregated collections, we
must
use the
_id
field to work well with
mapreduce
,
so we'll issue the following group of shard operations in the shell:
>>>
>>>
db
.
command
(
'shardcollection'
,
'dbname.stats.daily'
, {
...
...
'key'
: {
'_id'
:
1
} })
{ "collectionsharded": "dbname.stats.daily", "ok" : 1 }
>>>
>>>
db
.
command
(
'shardcollection'
,
'dbname.stats.weekly'
, {
...
...
'key'
: {
'_id'
:
1
} })
{ "collectionsharded": "dbname.stats.weekly", "ok" : 1 }
>>>
>>>
db
.
command
(
'shardcollection'
,
'dbname.stats.monthly'
, {
...
...
'key'
: {
'_id'
:
1
} })
{ "collectionsharded": "dbname.stats.monthly", "ok" : 1 }
>>>
>>>
db
.
command
(
'shardcollection'
,
'dbname.stats.yearly'
, {
...
...
'key'
: {
'_id'
:
1
} })
{ "collectionsharded": "dbname.stats.yearly", "ok" : 1 }
We also need to update the
h_aggregate
MapReduce wrapper to support sharded output by
adding
'sharded':True
to the
out
argument. Our new
h_aggregate
now looks like this: