Database Reference
In-Depth Information
per user document, the collection is roughly 1
GB
and can easily be served by a single
machine. But the
spreadsheets
collection is a different story. If you assume that each
user owns an average of 50 spreadsheets, and that each averages 50
KB
in size, then
you're talking about a 1
TB
spreadsheets
collection. If this is a highly active applica-
tion, then you'll want to keep the data in
RAM
. To keep this data in
RAM
and distribute
reads and writes, you must shard the collection. That's where sharding comes into play.
Sharding a collection
MongoDB's sharding is range-based. This means that every document in a sharded
collection must fall within some range of values for a given key. MongoDB uses a so-
called
shard key
to place each document within one of these ranges.
3
You can under-
stand this better by looking at a sample document from the theoretical spreadsheet
management application:
{
_id: ObjectId("4d6e9b89b600c2c196442c21")
filename: "spreadsheet-1",
updated_at: ISODate("2011-03-02T19:22:54.845Z"),
username: "banks",
data: "raw document data"
}
When you shard this collection, you must declare one or more of these fields as the
shard key. If you choose
_id
, then documents will be distributed based on ranges of
object
ID
s. But for reasons that will become clear later, you're going to declare a com-
pound shard key on
username
and
_id
; therefore, each range will usually represent
some series of user names.
You're now in a position to understand the concept of a
chunk
. A chunk is a contig-
uous range of shard key values located on a single shard. As an example, you can
imagine the
docs
collection being divided across two shards, A and B, into the chunks
you see in table 9.1. Each chunk's range is marked by a start and end value.
A cursory glance at this table reveals one of the main, sometimes counterintuitive,
properties of chunks: that although each individual chunk represents a contiguous
range of data, those ranges can appear on any shard.
A second important point about chunks is
that they're not
physical
but
logical
. In other
words, a chunk doesn't represent a contiguous
series of document on disk. Rather, to say that
the chunk beginning with
harris
and ending with
norris
exists on shard A is simply to say that any
document with a shard key falling within this
range can be found in shard A's
docs
collection.
This implies nothing about the
arrangement
of
those documents within the collection.
Table 9.1
Chunks and shards
Start
End
Shard
-
∞
abbot
B
abbot
dayton
A
dayton
harris
B
harris
norris
A
norris
∞
B
3
Alternate distributed databases might use the terms
partition key
or
distribution key
instead.