Sharding - MongoDB in Action

Database Reference

In-Depth Information

required for requests to any given shard. In other words, the application connects

locally to a mongos , and the mongos manages connections to the individual shards.

Config servers

If mongos processes are nonpersistent, then something must durably store the shard

cluster's canonical state; that's the job of the config servers. The config servers persist

the shard cluster's metadata. This data includes the global cluster configuration; the

locations of each database, collection, and the particular ranges of data therein; and a

change log preserving a history of the migrations of data across shards.

The metadata held by the config servers is central to the proper functioning and

upkeep of the cluster. For instance, every time a mongos process is started, the mongos

fetches a copy of the metadata from the config servers. Without this data, no coherent

view of the shard cluster is possible. The importance of this data, then, informs the

design and deployment strategy for the config servers.

If you examine figure 9.1, you'll see that there are three config servers, but that

they're not deployed as a replica set. They demand something stronger than asyn-

chronous replication; when the mongos process writes to them, it does so using a two-

phase commit. This guarantees consistency across config servers. You must run exactly

three config servers in any production deployment of sharding, and these servers

must reside on separate machines for redundancy. 1

You now know what a shard cluster consists of, but you're probably still wondering

about the sharding machinery itself. How is data actually distributed? I'll explain that

in the next section by introducing the core sharding operations.

C ORE SHARDING OPERATIONS

A MongoDB shard cluster distributes data across shards on two levels. The coarser of

these is by database. As you create new databases in the cluster, each database is

assigned to a different shard. If you do nothing else, a database and all its collections

will live forever on the shard where they were created.

Because most applications will keep all their data in the same physical database,

this distribution isn't very helpful. You need to distribute on a more granular level,

and the collection satisfies this requirement. MongoDB's sharding is designed specifi-

cally to distribute individual collections across shards. To understand this better, let's

imagine how this might work in a real application.

Suppose you're building a cloud-based office suite for managing spreadsheets and

that you're going to store all the data in MongoDB. 2 Users will be able to create as

many documents as they want, and you'll store each one as a separate MongoDB doc-

ument in a single spreadsheets collection. Assume that over time, your application

grows to a million users. Now imagine your two primary collections: users and

spreadsheets . The users collection is manageable. Even with a million users, at 1 KB

1

You can also run just a single config server, but only as a way of more easily testing sharding. Running with

just one config server in production is like taking a transatlantic flight in a single-engine jet: it might get you

there, but lose an engine and you're hosed.

2

Think something like Google Docs, which, among other things, allows users to create spreadsheets and pre-

sentations.

MongoDB in Action

Search WWH ::

Custom Search

Home