Database Reference
In-Depth Information
other shards is an entirely manual process. A second problem with manual sharding is
the difficulty of writing reliable application code to route reads and writes and manage
the database as a whole. Recently, frameworks for managing manual sharding have
been released, most notably Twitter's Gizzard (see http://mng.bz/4qvd ).
But as anyone who's manually sharded a database will tell you, getting it right isn't
easy. MongoDB was built in large part to address this problem. Because sharding is at
the heart of MongoDB, users need not worry about having to design an external
sharding framework when the time comes to scale horizontally. This is especially
important for handling the hard problem of balancing data across shards. The code
that makes balancing work isn't the sort of thing that most people can cook up over a
weekend.
Perhaps most significantly, MongoDB has been designed to present the same inter-
face to the application before and after sharding. This means that application code
needs little if any modification when the database must be converted to a sharded
architecture.
You should now have some sense for the rationale behind automated sharding. Still,
before describing the MongoDB sharding process in more detail, we should pause for
a moment to answer another obvious first question: when is sharding necessary?
W HEN TO SHARD
The question of when to shard is more straightforward than you might expect. We've
talked about the importance of keeping indexes and the working data set in RAM , and
this is the primary reason to shard. If an application's data set continues to grow
unbounded, then there will come a moment when that data no longer fits in RAM . If
you're running on Amazon's EC2 , then you'll hit that threshold at 68 GB because
that's the amount of RAM available on the largest instance at the time of this writing.
Alternatively, you may run your own hardware with much more than 68 GB of RAM , in
which case you'll probably be able to delay sharding for some time. But no machine
has infinite capacity for RAM ; therefore, sharding eventually becomes necessary.
To be sure, there are some fudge factors here. For instance, if you have your own
hardware and can store all your data on solid state drives (an increasingly affordable
prospect), then you'll likely be able to push the data-to- RAM ratio without negatively
affecting performance. It might also be the case that your working set is a fraction of
your total data size and that, therefore, you can operate with relatively little RAM . On
the flip side, if you have an especially demanding write load, then you may want to
shard well before data reaches the size of RAM , simply because you need to distribute
the load across machines to get the desired write throughput.
Whatever the case, the decision to shard an existing system will always be based on
regular analyses of disk activity, system load, and the ever-important ratio of working
set size to available RAM .
Search WWH ::




Custom Search