Database Reference
In-Depth Information
then discuss some sharding mechanics, describing how queries and indexing work
across shards. We'll look at the ever-important choice of shard key. And I'll end the
chapter with a lot of specific advice on running sharding in production.
Sharding is complicated. To get the most out of this chapter, you should run the
examples. You'll have no trouble running the sample cluster entirely on one machine;
once you do, start experimenting with it. There's nothing like having a live sharded
deployment on hand for understanding MongoDB as a distributed system.
9.1
Sharding overview
Before you build your first sharded cluster, it's useful to understand what sharding is
and why it's sometimes appropriate. The explanation about why sharding matters gets
to one of the core justifications for the MongoDB project as a whole. Once you under-
stand why sharding is important, you'll appreciate learning about the core compo-
nents that make up a sharded cluster and the key concepts that underlie MongoDB's
sharding machinery.
9.1.1
What sharding is
Up until this point, you've used MongoDB as a single server, where each mongod
instance contains a complete copy of your application's data. Even when using replica-
tion, each replica clones every other replica's data entirely. For the majority of applica-
tions, storing the complete data set on a single server is perfectly acceptable. But as
the size of the data grows, and as an application demands greater read and write
throughput, commodity servers may not be sufficient. In particular, these servers may
not be able to address enough RAM , or they might not have enough CPU cores, to pro-
cess the workload efficiently. In addition, as the size of the data grows, it may become
impractical to store and manage backups for such a large data set all on one disk or
RAID array. If you're to continue to use commodity or virtualized hardware to host the
database, then the solution to these problems is to distribute the database across more
than one server. The method for doing this is sharding.
Numerous large web applications, notably Flickr and LiveJournal, have imple-
mented manual sharding schemes to distribute load across MySQL databases. In these
implementations, the sharding logic lives entirely inside the application. To under-
stand how this works, imagine that you had so many users that you needed to distrib-
ute your Users table across multiple database servers. You could do this by designating
one database as the lookup database. This database would contain the metadata map-
ping each user ID (or some range of user ID s) to a given shard. Thus a query for a user
would actually involve two queries: the first query would contact the lookup database
to get the user's shard location and then a second query would be directed to the indi-
vidual shard containing the user data.
For these web applications, manual sharding solves the load problem, but the
implementation isn't without its faults. The most notable of these is the difficulty
involved in migrating data. If a single shard is overloaded, the migration of that data to
Search WWH ::




Custom Search