Database Reference
In-Depth Information
Because MongoDB distributes data using
chunks
based on ranges of the
shard key
, the choice
of shard key can control how MongoDB distributes data and the resulting systems' capacity
for writes and queries.
Ideally, your shard key should have two characteristics:
▪ Insertions are
balanced
between shards
▪ Most queries can be
routed
to a subset of the shards to be satisfied
Here are some initially appealing options for shard keys, which on closer examination, fail to
meet at least one of these criteria:
Timestamps
Shard keys based on the timestamp or the insertion time (i.e., the
ObjectId
) end up all
going in the “high” chunk, and therefore to a single shard. The inserts are not
balanced
.
Hashes
If the shard key is random, as with a hash, then all queries must be
broadcast
to all shards.
The queries are not
routeable
.
We'll now examine these options in more detail.
Option 1: Shard by time
Although using the timestamp, or the
ObjectId
in the
_id
field, would distribute your data
evenly among shards, these keys lead to two problems:
▪ All inserts always flow to the same shard, which means that your shard cluster will have
the same write throughput as a standalone instance.
▪ Most reads will tend to cluster on the same shard, assuming you access recent data more
frequently.
Option 2: Shard by a semi-random key
To distribute data more evenly among the shards, you may consider using a more “random”
piece of data, such as a hash of the
_id
field (i.e., the
ObjectId
as a shard key).
While this introduces some additional complexity into your application, to generate the key, it
will distribute writes among the shards. In these deployments, having five shards will provide
five times the write capacity as a single instance.