Databases Reference
In-Depth Information
queries the helper application, which is often a web service or a worker service such as
Gearman.
Cross-shard queries can also benefit from summary tables. You can build them by
traversing all the shards and storing the results redundantly on each shard when they're
complete. If duplicating the data on each shard would be too wasteful, you can con-
solidate the summary tables onto another data store, so they're stored only once.
Nonsharded data often lives in the global node, with heavy caching to shield it from
the load.
Some applications use essentially random sharding when perfectly even data distribu-
tion is important, or when there is no good partitioning key. A distributed search ap-
plication is a good example. In this case, cross-shard queries and aggregation are the
norm, not the exception.
Querying across shards isn't the only thing that's harder with sharding. Maintaining
data consistency is also difficult. Foreign keys won't work across shards, so the normal
solution is to check referential integrity as needed in the application, or use foreign keys
within a shard, because internal consistency within a shard might be the most important
thing. It's possible to use XA transactions, but this is uncommon in practice because
of the overhead.
You can also design cleanup processes that run intermittently. For example, if a user's
book club account expires, you don't have to remove it immediately. You can write a
periodic job to remove the user's comments from the per-book shard, and you can build
a checker script that runs periodically and makes sure the data is consistent across the
shards.
Allocating data, shards, and nodes
Shards and nodes don't have to have a one-to-one relationship. It's often a good idea
to make a shard's size much smaller than a node's capacity, so you can store multiple
shards on a single node.
Keeping each shard small helps keep the data manageable. It makes it easier to do
database backups and recovery, and if the tables are small, it can ease jobs such as
schema changes. For example, suppose you have a 100 GB table that you can either
store as it is or split into 100 shards of 1 GB tables, which you would store on a single
node. Now suppose you want to add an index to the table(s). This would take much
longer on a 100 GB shard than it would on all the 1 GB shards combined, because the
1 GB shards fit completely in memory. You also might need to make the data unavail-
able while ALTER TABLE is running, and blocking 1 GB of data is much better than
blocking 100 GB.
 
Search WWH ::




Custom Search