Sharding - The Definitive Guide to MongoDB

Database Reference

In-Depth Information

Chapter 12

Sharding

Whether you're building the next Facebook or just a simple database application, you will probably need to scale

your app up at some point if it's successful. If you don't want to be continually replacing your hardware (or you begin

approaching the limits of what you can do on just one piece of hardware), then you will want to use a technique that

allows you to add capacity incrementally to your system, as you need it. Sharding is a technique that allows you to

spread your data across multiple machines, yet does so in a way that mimics an app hitting a single database.

Ideally suited for cloud-based computing platforms, sharding as implemented by MongoDB is perfect for

dynamic, load-sensitive automatic scaling, where you ramp up your capacity as you need it and turn it down when

you don't.

This chapter will walk you through implementing sharding in MongoDB and will look at some of the advanced

functionality provided within MongoDB's sharding implementation, such as tag sharding and hashed shard keys.

Exploring the Need for Sharding

When the World Wide Web was just getting under way, the number of sites, users, and the amount of information

available online was low. The Web consisted of a few thousand sites and a population of only tens or perhaps

hundreds of thousands of users predominantly centered on the academic and research communities. In those early

days, data tended to be simple: hand-maintained HTML documents connected together by hyperlinks. The original

design objective of the protocols that make up the Web was to provide a means of creating navigable references to

documents stored on different servers around the Internet.

Even a current big brand name such as Yahoo! had only a minuscule presence on the Web compared to its

offerings today. The original product around which the company was formed was the Yahoo directory, little more

than a network of hand-edited links to popular sites. These links were maintained by a small but enthusiastic band of

people called the surfers . Each page in the Yahoo directory was a simple HTML document stored in a tree of filesystem

directories and maintained using a simple text editor.

But as the size of the net started to explode—and the number of sites and visitors started its near-vertical

climb upwards—the sheer volume of resources available forced the early Web pioneers to move away from simple

documents to more complex dynamic page generation from separate data stores.

Search engines started to spider the Web and pull together databases of links that today number in the hundreds

of billions of links and tens of billions of stored pages.

These developments prompted the movement to datasets managed and maintained by evolving content

management systems that were stored mainly in databases for easier access.

At the same time, new kinds of services evolved that stored more than just documents and link sets. For example,

audio, video, events, and all kinds of other data started to make their way into these huge datastores. This process is

often described as the “industrialization of data”—and in many ways it shares parallels with the industrial revolution

centered on manufacturing during the 19th century.

Eventually, every successful company on the Web faces the problem of how to access the data stored in these

mammoth databases. They find that there are only so many queries per second that can be handled with a single

database server, and network interfaces and disk drives can only transfer so many megabytes per second to and from

Search WWH ::

Custom Search

Home