Sharding - The Definitive Guide to MongoDB

Database Reference

In-Depth Information

Now you know how to start and stop the balancer, and how to check what the balancer is doing at a given point.

You will also want to be able to set a window when the balancer will be active. As an example, let's set our balancer

to run between 8PM and 6AM, which lets it run overnight when our cluster is (hypothetically) less active. To do this,

we update the balancer settings document from before, as it controls whether the balancer is running. The exchange

looks like this:

> use config

switched to db config

>db.settings.update({_id:"balancer"}, { $set : { activeWindow : { start : "20:00", stop : "6:00" } }

}

And that will do it; your balancer document will now have an activeWindow that will start it at 8PM and stop it at

6AM. You should now be able to start and stop the balancer, confirm its state and when it was last running, and finally

set a time window in which the balancer is active.

Hashed Shard Keys

Earlier we discussed how important it is to pick the correct shard key. If you pick the wrong shard key, you can cause

all kinds of performance problems. Take, for example, sharding on _id , which is an ever-increasing value. Each insert

you make will be sent to the shard in your set that currently holds the highest _id value. As each new insert is the

“largest” value that has been inserted, you will always be inserting data to the same place. This means you will have

one “hot” shard in your cluster that is receiving all inserts and has all documents being migrated from it to the other

shards—not very efficient.

To people solve this problem, MongoDB 2.4 introduced a new feature—hashed shard keys! A hashed shard

key will create a hash for each of the values on a given field and then use these hashes to perform the chunking and

sharding operations. This allows you to take an increasing value such as an _id field and generate a hash for each

given _id value, which will give randomness to values. Adding this level of randomness should normally allow you

to distribute writes evenly to all shards. The cost, however, is that you'll have random reads as well, which can be a

performance penalty if you wish to perform operations over a range of documents. For this reason hashed sharding

may be inefficient when compared with a user-selected shard key under certain workloads.

■ Because of the way hashing is implemented, there are some limitations when you shard on floating-point

(decimal) numbers, which mean that values such as 2.3, 2.4, and 2.9 will become the same hashed value.

Note

So, to create a hashed shard we simply run the shardCollection and create a "hashed" index!

sh.shardCollection( " testdb.testhashed", { _id: "hashed" } )

And that's it! You have now created a hashed shard key, which will hash the incoming _id values in order to

distribute your data in a more “random” nature. Now, with all this in mind, some of you may be saying - why not

always use a hashed shard key?

Good question; and the answer is that sharding is just one of “those” dark arts. The optimum shard key is one that

allows your writes to be distributed well over a number of shards, so that the writes are effectively parallel. It is also a

key that allows you to group so that writes go to only one or a limited number of shards, and it must allow you to make

more effective use of the indexes held on the individual shards. All of those factors will be determined by your use

case, what you are storing, and how you are retrieving it.

Search WWH ::

Custom Search

Home