Database Reference
In-Depth Information
The second benefit of a coarse-grained shard key is that it lets the system ride out
the efficiency gains provided by locality of reference. When a user inserts 100 photo
metadata documents, a shard key on the user ID field ensures that these inserts go to
the same shard and are written to roughly the same parts of the index. This is efficient.
Distribution and locality are great benefits, but with a coarse-grained shard key
comes one intractable problem: the potential for uninhibited chunk growth. How is
that possible? Think for a moment about the sample shard key on user ID . What's the
smallest possible chunk range permitted by this shard key? The smallest range will
span a single user ID ; no smaller range is possible. This is problematic because every
data set has outliers. Suppose that you have a few outlier users whose number of pho-
tos stored exceeds the average user's number by millions. Will the system ever be able
to split any one user's photos into more than one chunk? The answer is, no. This is an
unsplittable chunk, and it's a danger to a shard cluster because it can create an imbal-
ance of data across shards.
Clearly, the ideal would be a shard key that combines the benefits of a coarse-
grained key with those of a fine-grained one. You'll see what that looks like in the next
section.
9.4.2
Ideal shard keys
From the preceding, it should be clear that you want to choose a shard key that will
Distribute inserts evenly across shards
1
Ensure that CRUD operations can take advantage of locality
2
Be granular enough to allow chunks to split
3
Shard keys that fit these requirements are generally composed of two fields, the first
being coarse-grained and the second more fine-grained. A good example of this is the
shard key in the spreadsheets example. There, you declared a compound shard key
on {username: 1, _id: 1} . As various users insert into the cluster, you can expect that
most, if not all, of any one user's spreadsheets will live on a single shard. Even when a
user's documents reside on more than one shard, the presence of the unique _id field
in the shard key guarantees that queries and updates to any one document will always
be directed to a single shard. And if you need to perform a more sophisticated query
on a user's data, you can be sure that the query will be routed only to those shards
containing that user's data.
Most importantly, the shard key on {username: 1, _id: 1} guarantees that chunks
will always be splittable, even if a given user creates a huge number of documents.
Let's take another example. Suppose you're building a website analytics system. As
you'll see in appendix B, a good data model for such a system would store one docu-
ment per page per month. Then, within that document, you'd store the data for each
day of the month, incrementing various counter fields for each page hit, and so on.
Here are the fields of a sample analytics document relevant to your shard key choice:
Search WWH ::




Custom Search