Sharding - MongoDB in Action

Database Reference

In-Depth Information

then discuss some sharding mechanics, describing how queries and indexing work

across shards. We'll look at the ever-important choice of shard key. And I'll end the

chapter with a lot of specific advice on running sharding in production.

Sharding is complicated. To get the most out of this chapter, you should run the

examples. You'll have no trouble running the sample cluster entirely on one machine;

once you do, start experimenting with it. There's nothing like having a live sharded

deployment on hand for understanding MongoDB as a distributed system.

9.1

Sharding overview

Before you build your first sharded cluster, it's useful to understand what sharding is

and why it's sometimes appropriate. The explanation about why sharding matters gets

to one of the core justifications for the MongoDB project as a whole. Once you under-

stand why sharding is important, you'll appreciate learning about the core compo-

nents that make up a sharded cluster and the key concepts that underlie MongoDB's

sharding machinery.

9.1.1

What sharding is

Up until this point, you've used MongoDB as a single server, where each mongod

instance contains a complete copy of your application's data. Even when using replica-

tion, each replica clones every other replica's data entirely. For the majority of applica-

tions, storing the complete data set on a single server is perfectly acceptable. But as

the size of the data grows, and as an application demands greater read and write

throughput, commodity servers may not be sufficient. In particular, these servers may

not be able to address enough RAM , or they might not have enough CPU cores, to pro-

cess the workload efficiently. In addition, as the size of the data grows, it may become

impractical to store and manage backups for such a large data set all on one disk or

RAID array. If you're to continue to use commodity or virtualized hardware to host the

database, then the solution to these problems is to distribute the database across more

than one server. The method for doing this is sharding.

Numerous large web applications, notably Flickr and LiveJournal, have imple-

mented manual sharding schemes to distribute load across MySQL databases. In these

implementations, the sharding logic lives entirely inside the application. To under-

stand how this works, imagine that you had so many users that you needed to distrib-

ute your Users table across multiple database servers. You could do this by designating

one database as the lookup database. This database would contain the metadata map-

ping each user ID (or some range of user ID s) to a given shard. Thus a query for a user

would actually involve two queries: the first query would contact the lookup database

to get the user's shard location and then a second query would be directed to the indi-

vidual shard containing the user data.

For these web applications, manual sharding solves the load problem, but the

implementation isn't without its faults. The most notable of these is the difficulty

involved in migrating data. If a single shard is overloaded, the migration of that data to

Search WWH ::

Custom Search

Home