Database Reference
In-Depth Information
L
ACK
OF
LOCALITY
An ascending shard key has a clear direction; a completely random shard key has no
direction at all. The former fails to distribute inserts; the latter might distribute them
too well. This may have a counterintuitive ring to it, as the whole point of sharding is
to distribute reads and writes. But we can illustrate with a simple thought experiment.
Imagine that each of the documents in your sharded collection contains an
MD5
and that the
MD5
field is the shard key. Because the value of an
MD5
will vary ran-
domly across documents, this shard key will ensure that inserts distribute evenly across
all shards in the cluster. This is desirable. But take a second to imagine how inserts
into each shard's
index
on the
MD5
fields will work. Because the
MD5
is totally random,
each virtual memory page in the index is equally likely to be accessed on every insert.
Practically speaking, this means that the index must always fit in
RAM
and that if the
indexes and data ever grow beyond the limits of physical
RAM
, page faults, and thus
decreased performance, will be inevitable.
This is basically a problem of
locality of reference
. The idea of locality, at least here, is
that data accessed within any given time interval is frequently related; this closeness
can be exploited for optimization. For example, although object
ID
s make for poor
shard keys, they do provide excellent locality because they're ascending. This means
that successive inserts into the index will always occur within recently used virtual
memory pages; so only a small portion of the index need be in
RAM
at any one time.
To take a less abstract example, imagine your application allows users to upload
photos and that each photo's metadata is stored in a single document within a
sharded collection. Now suppose that a user performs a batch upload of 100 photos.
If the shard key is totally random, the database won't be able to take advantage of the
locality here; inserts into the index will occur at 100 random locations. But let's say
the shard key is the user's
ID.
In this case, each write to the index will occur at
roughly the
same
location because every inserted document will have the same user
ID
value. Here you take advantage of locality, and you realize potentially significant
gains in performance.
Random shard keys present one more problem: any meaningful range query on
such a key will have to be sent to all shards. Think again about the sharded photos col-
lection just described. If you want your app to be able to display a user's 10 most
recently created photos (normally a trivial query), a random shard key will still require
that this query be distributed to all shards. As you'll see in the next section, a coarser-
grained shard key will permit a range query like this to take place on a single shard.
U
NSPLITTABLE
CHUNKS
If random and ascending shard keys don't work well, then the next obvious option is a
coarse-grained shard key. A user
ID
is a good example of this. If you shard your photos
collection on a user
ID
, then you can expect to see inserts distributed across shards, if
only because it's impossible to tell which users will insert data when. Thus the coarse-
grained shard key takes the element of randomness and uses that to the shard cluster's
advantage.