Database Reference
In-Depth Information
the web servers. Companies that provide web-based services can quickly find themselves exceeding the performance
of a single server, network, or drive array. In such cases, they are compelled to divide and distribute their massive
collections of data. The usual solution is to partition these mammoth chunks of data into smaller pieces that can
be managed more reliably and quickly. At the same time, these companies need to maintain the ability to perform
operations across the entire breadth of the data held in their large clusters of machines.
Replication, which you learned about in some detail in Chapter 11, can be an effective tool for overcoming some
of these scaling issues, enabling you to create multiple identical copies of your data in multiple servers. This enables
you (in the correct circumstances) to spread out your server load across more machines.
Before long, however, you run headlong into another problem, where the individual tables or collections that
make up your dataset grow so large that their size exceeds the capacity of a single database system to manage them
effectively. For example, Facebook has let it be known that it receives over 300 million photos per day! And the site has
been operating for almost 10 years.
Over a year that's 109.5 billion photos, and that amount of data in one table is not feasible. So Facebook, like
many companies before them, looked at ways of distributing that set of records across a large number of database
servers. The solution adopted by Facebook serves as one of the better-documented (and publicized) implementations
of sharding in the real world.
Partitioning Horizontal and Vertical Data
Data partitioning is the mechanism of splitting data across multiple independent datastores. Those datastores can
be coresident (on the same system) or remote (on separate systems). The motivation for coresident partitioning is to
reduce the size of individual indexes and reduce the amount of I/O that is needed to update records. The motivation
for remote partitioning is to increase the bandwidth of access to data, by having more RAM in which to store data, by
avoiding disk access, or by having more network interfaces and disk I/O channels available.
Partitioning Data Vertically
In the traditional view of databases, data is stored in rows and columns. Vertical partitioning consists of breaking
up a record on column boundaries and storing the parts in separate tables or collections. It can be argued that a
relational database design that uses joined tables with a one-to-one relationship is a form of coresident vertical data
partitioning.
MongoDB, however, does not lend itself to this form of partitioning, because the structure of its records
(documents) does not fit the nice and tidy row-and-column model. Therefore, there are few opportunities to cleanly
separate a row based on its column boundaries. MongoDB also promotes the use of embedded documents, and it
does not directly support the ability to join associated collections together on the server (these can be done in your
application).
Partitioning Data Horizontally
Horizontal partitioning is the only alternative when using MongoDB, and sharding is the common term for a
popular form of horizontal partitioning. Sharding allows you to split a collection across multiple servers to improve
performance in a collection that contains a large number of documents.
A simple example of sharding occurs when a collection of user records is divided across a set of servers, so that all
the records for people with last names that begin with the letters A-G are on one server, H-M are on another, and so
on. The rule that splits the data is known as the shard key .
In simple terms, sharding allows you to treat the cloud of shards as through it were a single collection, and an
application does not need to be aware that the data is distributed across multiple machines. Traditional sharding
implementations require the application to be actively involved in determining which server a particular document
is stored on, so it can route its requests properly. Traditionally, there is a library bound to the application, and this
library is responsible for storing and querying data in sharded data sets.
 
Search WWH ::




Custom Search