Storing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

Sharding

In addition to replication, MongoDB also supports auto-balancing sharding

for data being stored in collections (except capped collections). The

implementation is quite similar to the approach used by twemproxy to

cluster Redis, except that data may move between shards over time.

MongoDB's sharding implementation is designed to transition smoothly

from an unsharded environment to a sharded one. The usual

recommendation, when starting to use MongoDB, is to begin with an

unsharded dataset in a normal Replica Set and then add shards as the

dataset grows to the point that the working set starts to get close to available

RAM on a single server.

The reason this works is because MongoDB shards are simply Replica Sets.

To implement the sharding itself, MongoDB introduces an auxiliary server

called mongos . In combination with a configuration server, which is simply

another MongoDB Replica Set, the mongos server acts as an intermediary

between applications and the database shards. It is responsible for

distributing queries and collating results for return to a client. In this sense,

mongos is very similar to the twemproxy server used to shard Redis and

Memcached instances. A sharded MongoDB cluster may have any number

of mongos servers running at any given time. A common strategy is to

run a mongos process on each application server to simplify application

configuration.

The mongos server differs from twemproxy in that it is also responsible

for cluster rebalancing. Unlike many sharding approaches, which rely on

a predetermined shard key, MongoDB attempts to automatically balance

data and load across the cluster. This is handled by a balancer process,

initiated by one of the mongos servers attached to the cluster. Normally this

process is automatic and started when mongos detects sufficient imbalance

in cluster resources. However, because rebalancing the cluster places a load

on the system, it is possible to configure the balancer to only run during

specific time windows when load is expected to be low (for example, early

morning hours for most Internet applications).

To start using sharding in MongoDB, first a set of configuration servers

must be established. These configuration servers play the same role as

ZooKeeper plays for Kafka, providing configuration metadata. Most

importantly, it will contain information about how keys are distributed

Search WWH ::

Custom Search

Home