Sharding - MongoDB in Action

Database Reference

In-Depth Information

{ _id: ObjectId("4d750a90c35169d10fc8c982"),

domain: "org.mongodb",

url: "/downloads",

period: "2011-12"

}

The simplest shard key for a sharded collection containing documents like this would

consist of each page's domain followed by its url: {domain: 1, url: 1} . All pages from

a given domain would generally live on a single shard, but the outlier domains with

massive numbers of pages would still be split across shards when necessary.

9.5

Sharding in production

When deploying a shard cluster to production, you're presented with a number of

choices and challenges. Here I describe a couple of recommended deployment

topologies and provide some answers to common deployment questions. We'll then

consider matters of server administration, including monitoring, backups, failover,

and recovery.

9.5.1

Deployment and configuration

Deployment and configuration are hard to get right the first time around. The follow-

ing are some guidelines for organizing the cluster and configuring with ease.

D EPLOYMENT TOPOLOGIES

To l a u n c h t h e s a m pl e M o n g o D B s h a r d c l u s t e r, y o u h a d t o s t a r t a t o t a l o f n i n e p r o -

cesses (three mongod s for each replica set, plus three config servers). That's a poten-

tially frightening number. First-time users might assume that running a two-shard

cluster in production would require nine separate machines. Fortunately, many fewer

are needed. You can see why by looking at the expected resource requirements for

each component of the cluster.

Consider first the replica sets. Each replicating member contains a complete copy

of the data for its shard and may run as a primary or secondary node. These processes

will always require enough disk space to store their copy of the data, and enough RAM

to serve that data efficiently. Thus replicating mongod s are the most resource-intensive

processes in a shard cluster and must be given their own machines.

What about replica set arbiters? These processes store replica set config data only,

which is kept in a single document. Hence, arbiters incur little overhead and certainly

don't need their own servers.

Next are the config servers. These also store a relatively small amount of data. For

instance, the data on the config servers managing the sample replica set totaled only

about 30 KB . If you assume that this data will grow linearly with shard cluster data size,

then a 1 TB shard cluster might swell the config servers' data size to a mere 30 MB . 14

This means that config servers don't necessarily need their own machines, either. But

14

That's a highly conservative estimate. The real value will likely be far smaller.

Search WWH ::

Custom Search

Home