Sharding - The Definitive Guide to MongoDB

Database Reference

In-Depth Information

the web servers. Companies that provide web-based services can quickly find themselves exceeding the performance

of a single server, network, or drive array. In such cases, they are compelled to divide and distribute their massive

collections of data. The usual solution is to partition these mammoth chunks of data into smaller pieces that can

be managed more reliably and quickly. At the same time, these companies need to maintain the ability to perform

operations across the entire breadth of the data held in their large clusters of machines.

Replication, which you learned about in some detail in Chapter 11, can be an effective tool for overcoming some

of these scaling issues, enabling you to create multiple identical copies of your data in multiple servers. This enables

you (in the correct circumstances) to spread out your server load across more machines.

Before long, however, you run headlong into another problem, where the individual tables or collections that

make up your dataset grow so large that their size exceeds the capacity of a single database system to manage them

effectively. For example, Facebook has let it be known that it receives over 300 million photos per day! And the site has

been operating for almost 10 years.

Over a year that's 109.5 billion photos, and that amount of data in one table is not feasible. So Facebook, like

many companies before them, looked at ways of distributing that set of records across a large number of database

servers. The solution adopted by Facebook serves as one of the better-documented (and publicized) implementations

of sharding in the real world.

Partitioning Horizontal and Vertical Data

Data partitioning is the mechanism of splitting data across multiple independent datastores. Those datastores can

be coresident (on the same system) or remote (on separate systems). The motivation for coresident partitioning is to

reduce the size of individual indexes and reduce the amount of I/O that is needed to update records. The motivation

for remote partitioning is to increase the bandwidth of access to data, by having more RAM in which to store data, by

avoiding disk access, or by having more network interfaces and disk I/O channels available.

Partitioning Data Vertically

In the traditional view of databases, data is stored in rows and columns. Vertical partitioning consists of breaking

up a record on column boundaries and storing the parts in separate tables or collections. It can be argued that a

relational database design that uses joined tables with a one-to-one relationship is a form of coresident vertical data

partitioning.

MongoDB, however, does not lend itself to this form of partitioning, because the structure of its records

(documents) does not fit the nice and tidy row-and-column model. Therefore, there are few opportunities to cleanly

separate a row based on its column boundaries. MongoDB also promotes the use of embedded documents, and it

does not directly support the ability to join associated collections together on the server (these can be done in your

application).

Partitioning Data Horizontally

Horizontal partitioning is the only alternative when using MongoDB, and sharding is the common term for a

popular form of horizontal partitioning. Sharding allows you to split a collection across multiple servers to improve

performance in a collection that contains a large number of documents.

A simple example of sharding occurs when a collection of user records is divided across a set of servers, so that all

the records for people with last names that begin with the letters A-G are on one server, H-M are on another, and so

on. The rule that splits the data is known as the shard key .

In simple terms, sharding allows you to treat the cloud of shards as through it were a single collection, and an

application does not need to be aware that the data is distributed across multiple machines. Traditional sharding

implementations require the application to be actively involved in determining which server a particular document

is stored on, so it can route its requests properly. Traditionally, there is a library bound to the application, and this

library is responsible for storing and querying data in sharded data sets.

The Definitive Guide to MongoDB

Search WWH ::

Custom Search

Home