Database Reference
In-Depth Information
Chapter 12
Sharding
Whether you're building the next Facebook or just a simple database application, you will probably need to scale
your app up at some point if it's successful. If you don't want to be continually replacing your hardware (or you begin
approaching the limits of what you can do on just one piece of hardware), then you will want to use a technique that
allows you to add capacity incrementally to your system, as you need it. Sharding is a technique that allows you to
spread your data across multiple machines, yet does so in a way that mimics an app hitting a single database.
Ideally suited for cloud-based computing platforms, sharding as implemented by MongoDB is perfect for
dynamic, load-sensitive automatic scaling, where you ramp up your capacity as you need it and turn it down when
you don't.
This chapter will walk you through implementing sharding in MongoDB and will look at some of the advanced
functionality provided within MongoDB's sharding implementation, such as tag sharding and hashed shard keys.
Exploring the Need for Sharding
When the World Wide Web was just getting under way, the number of sites, users, and the amount of information
available online was low. The Web consisted of a few thousand sites and a population of only tens or perhaps
hundreds of thousands of users predominantly centered on the academic and research communities. In those early
days, data tended to be simple: hand-maintained HTML documents connected together by hyperlinks. The original
design objective of the protocols that make up the Web was to provide a means of creating navigable references to
documents stored on different servers around the Internet.
Even a current big brand name such as Yahoo! had only a minuscule presence on the Web compared to its
offerings today. The original product around which the company was formed was the Yahoo directory, little more
than a network of hand-edited links to popular sites. These links were maintained by a small but enthusiastic band of
people called the surfers . Each page in the Yahoo directory was a simple HTML document stored in a tree of filesystem
directories and maintained using a simple text editor.
But as the size of the net started to explode—and the number of sites and visitors started its near-vertical
climb upwards—the sheer volume of resources available forced the early Web pioneers to move away from simple
documents to more complex dynamic page generation from separate data stores.
Search engines started to spider the Web and pull together databases of links that today number in the hundreds
of billions of links and tens of billions of stored pages.
These developments prompted the movement to datasets managed and maintained by evolving content
management systems that were stored mainly in databases for easier access.
At the same time, new kinds of services evolved that stored more than just documents and link sets. For example,
audio, video, events, and all kinds of other data started to make their way into these huge datastores. This process is
often described as the “industrialization of data”—and in many ways it shares parallels with the industrial revolution
centered on manufacturing during the 19th century.
Eventually, every successful company on the Web faces the problem of how to access the data stored in these
mammoth databases. They find that there are only so many queries per second that can be handled with a single
database server, and network interfaces and disk drives can only transfer so many megabytes per second to and from
 
Search WWH ::




Custom Search