Database Reference
In-Depth Information
required for requests to any given shard. In other words, the application connects
locally to a mongos , and the mongos manages connections to the individual shards.
Config servers
If mongos processes are nonpersistent, then something must durably store the shard
cluster's canonical state; that's the job of the config servers. The config servers persist
the shard cluster's metadata. This data includes the global cluster configuration; the
locations of each database, collection, and the particular ranges of data therein; and a
change log preserving a history of the migrations of data across shards.
The metadata held by the config servers is central to the proper functioning and
upkeep of the cluster. For instance, every time a mongos process is started, the mongos
fetches a copy of the metadata from the config servers. Without this data, no coherent
view of the shard cluster is possible. The importance of this data, then, informs the
design and deployment strategy for the config servers.
If you examine figure 9.1, you'll see that there are three config servers, but that
they're not deployed as a replica set. They demand something stronger than asyn-
chronous replication; when the mongos process writes to them, it does so using a two-
phase commit. This guarantees consistency across config servers. You must run exactly
three config servers in any production deployment of sharding, and these servers
must reside on separate machines for redundancy. 1
You now know what a shard cluster consists of, but you're probably still wondering
about the sharding machinery itself. How is data actually distributed? I'll explain that
in the next section by introducing the core sharding operations.
C ORE SHARDING OPERATIONS
A MongoDB shard cluster distributes data across shards on two levels. The coarser of
these is by database. As you create new databases in the cluster, each database is
assigned to a different shard. If you do nothing else, a database and all its collections
will live forever on the shard where they were created.
Because most applications will keep all their data in the same physical database,
this distribution isn't very helpful. You need to distribute on a more granular level,
and the collection satisfies this requirement. MongoDB's sharding is designed specifi-
cally to distribute individual collections across shards. To understand this better, let's
imagine how this might work in a real application.
Suppose you're building a cloud-based office suite for managing spreadsheets and
that you're going to store all the data in MongoDB. 2 Users will be able to create as
many documents as they want, and you'll store each one as a separate MongoDB doc-
ument in a single spreadsheets collection. Assume that over time, your application
grows to a million users. Now imagine your two primary collections: users and
spreadsheets . The users collection is manageable. Even with a million users, at 1 KB
1
You can also run just a single config server, but only as a way of more easily testing sharding. Running with
just one config server in production is like taking a transatlantic flight in a single-engine jet: it might get you
there, but lose an engine and you're hosed.
2
Think something like Google Docs, which, among other things, allows users to create spreadsheets and pre-
sentations.
Search WWH ::




Custom Search