Database Reference
In-Depth Information
L ACK OF LOCALITY
An ascending shard key has a clear direction; a completely random shard key has no
direction at all. The former fails to distribute inserts; the latter might distribute them
too well. This may have a counterintuitive ring to it, as the whole point of sharding is
to distribute reads and writes. But we can illustrate with a simple thought experiment.
Imagine that each of the documents in your sharded collection contains an MD5
and that the MD5 field is the shard key. Because the value of an MD5 will vary ran-
domly across documents, this shard key will ensure that inserts distribute evenly across
all shards in the cluster. This is desirable. But take a second to imagine how inserts
into each shard's index on the MD5 fields will work. Because the MD5 is totally random,
each virtual memory page in the index is equally likely to be accessed on every insert.
Practically speaking, this means that the index must always fit in RAM and that if the
indexes and data ever grow beyond the limits of physical RAM , page faults, and thus
decreased performance, will be inevitable.
This is basically a problem of locality of reference . The idea of locality, at least here, is
that data accessed within any given time interval is frequently related; this closeness
can be exploited for optimization. For example, although object ID s make for poor
shard keys, they do provide excellent locality because they're ascending. This means
that successive inserts into the index will always occur within recently used virtual
memory pages; so only a small portion of the index need be in RAM at any one time.
To take a less abstract example, imagine your application allows users to upload
photos and that each photo's metadata is stored in a single document within a
sharded collection. Now suppose that a user performs a batch upload of 100 photos.
If the shard key is totally random, the database won't be able to take advantage of the
locality here; inserts into the index will occur at 100 random locations. But let's say
the shard key is the user's ID. In this case, each write to the index will occur at
roughly the same location because every inserted document will have the same user
ID value. Here you take advantage of locality, and you realize potentially significant
gains in performance.
Random shard keys present one more problem: any meaningful range query on
such a key will have to be sent to all shards. Think again about the sharded photos col-
lection just described. If you want your app to be able to display a user's 10 most
recently created photos (normally a trivial query), a random shard key will still require
that this query be distributed to all shards. As you'll see in the next section, a coarser-
grained shard key will permit a range query like this to take place on a single shard.
U NSPLITTABLE CHUNKS
If random and ascending shard keys don't work well, then the next obvious option is a
coarse-grained shard key. A user ID is a good example of this. If you shard your photos
collection on a user ID , then you can expect to see inserts distributed across shards, if
only because it's impossible to tell which users will insert data when. Thus the coarse-
grained shard key takes the element of randomness and uses that to the shard cluster's
advantage.
Search WWH ::




Custom Search