Sharding - MongoDB in Action

Database Reference

In-Depth Information

The second benefit of a coarse-grained shard key is that it lets the system ride out

the efficiency gains provided by locality of reference. When a user inserts 100 photo

metadata documents, a shard key on the user ID field ensures that these inserts go to

the same shard and are written to roughly the same parts of the index. This is efficient.

Distribution and locality are great benefits, but with a coarse-grained shard key

comes one intractable problem: the potential for uninhibited chunk growth. How is

that possible? Think for a moment about the sample shard key on user ID . What's the

smallest possible chunk range permitted by this shard key? The smallest range will

span a single user ID ; no smaller range is possible. This is problematic because every

data set has outliers. Suppose that you have a few outlier users whose number of pho-

tos stored exceeds the average user's number by millions. Will the system ever be able

to split any one user's photos into more than one chunk? The answer is, no. This is an

unsplittable chunk, and it's a danger to a shard cluster because it can create an imbal-

ance of data across shards.

Clearly, the ideal would be a shard key that combines the benefits of a coarse-

grained key with those of a fine-grained one. You'll see what that looks like in the next

section.

9.4.2

Ideal shard keys

From the preceding, it should be clear that you want to choose a shard key that will

Distribute inserts evenly across shards

1

Ensure that CRUD operations can take advantage of locality

2

Be granular enough to allow chunks to split

3

Shard keys that fit these requirements are generally composed of two fields, the first

being coarse-grained and the second more fine-grained. A good example of this is the

shard key in the spreadsheets example. There, you declared a compound shard key

on {username: 1, _id: 1} . As various users insert into the cluster, you can expect that

most, if not all, of any one user's spreadsheets will live on a single shard. Even when a

user's documents reside on more than one shard, the presence of the unique _id field

in the shard key guarantees that queries and updates to any one document will always

be directed to a single shard. And if you need to perform a more sophisticated query

on a user's data, you can be sure that the query will be routed only to those shards

containing that user's data.

Most importantly, the shard key on {username: 1, _id: 1} guarantees that chunks

will always be splittable, even if a given user creates a huge number of documents.

Let's take another example. Suppose you're building a website analytics system. As

you'll see in appendix B, a good data model for such a system would store one docu-

ment per page per month. Then, within that document, you'd store the data for each

day of the month, incrementing various counter fields for each page hit, and so on.

Here are the fields of a sample analytics document relevant to your shard key choice:

MongoDB in Action

Search WWH ::

Custom Search

Home