Sharding - MongoDB in Action

Database Reference

In-Depth Information

other shards is an entirely manual process. A second problem with manual sharding is

the difficulty of writing reliable application code to route reads and writes and manage

the database as a whole. Recently, frameworks for managing manual sharding have

been released, most notably Twitter's Gizzard (see http://mng.bz/4qvd ).

But as anyone who's manually sharded a database will tell you, getting it right isn't

easy. MongoDB was built in large part to address this problem. Because sharding is at

the heart of MongoDB, users need not worry about having to design an external

sharding framework when the time comes to scale horizontally. This is especially

important for handling the hard problem of balancing data across shards. The code

that makes balancing work isn't the sort of thing that most people can cook up over a

weekend.

Perhaps most significantly, MongoDB has been designed to present the same inter-

face to the application before and after sharding. This means that application code

needs little if any modification when the database must be converted to a sharded

architecture.

You should now have some sense for the rationale behind automated sharding. Still,

before describing the MongoDB sharding process in more detail, we should pause for

a moment to answer another obvious first question: when is sharding necessary?

W HEN TO SHARD

The question of when to shard is more straightforward than you might expect. We've

talked about the importance of keeping indexes and the working data set in RAM , and

this is the primary reason to shard. If an application's data set continues to grow

unbounded, then there will come a moment when that data no longer fits in RAM . If

you're running on Amazon's EC2 , then you'll hit that threshold at 68 GB because

that's the amount of RAM available on the largest instance at the time of this writing.

Alternatively, you may run your own hardware with much more than 68 GB of RAM , in

which case you'll probably be able to delay sharding for some time. But no machine

has infinite capacity for RAM ; therefore, sharding eventually becomes necessary.

To be sure, there are some fudge factors here. For instance, if you have your own

hardware and can store all your data on solid state drives (an increasingly affordable

prospect), then you'll likely be able to push the data-to- RAM ratio without negatively

affecting performance. It might also be the case that your working set is a fraction of

your total data size and that, therefore, you can operate with relatively little RAM . On

the flip side, if you have an especially demanding write load, then you may want to

shard well before data reaches the size of RAM , simply because you need to distribute

the load across machines to get the desired write throughput.

Whatever the case, the decision to shard an existing system will always be based on

regular analyses of disk activity, system load, and the ever-important ratio of working

set size to available RAM .

Search WWH ::

Custom Search

Home