Scaling MySQL - High Performance MySQL

Databases Reference

In-Depth Information

equally populated in total, and you can choose them with an affinity that helps avoid

cross-shard queries.

Multiple partitioning keys

Complicated data models make data sharding more difficult. Many applications have

more than one partitioning key, especially if there are two or more important “dimen-

sions” in the data. In other words, the application might need to see an efficient, co-

herent view of the data from different angles. This means you might need to store at

least some data twice within the system.

For example, you might need to shard your blogging application's data by both the

user ID and the post ID, because these are two common ways the application looks at

the data. Think of it this way: you frequently want to see all posts for a user, and all

comments for a post. But sharding by user doesn't help you find comments for a post,

and sharding by post doesn't help you find posts for a user. If you need both types of

queries to touch only a single shard, you'll have to shard both ways.

Just because you need multiple partitioning keys doesn't mean you'll need to design

two completely redundant data stores. Let's look at another example: a social net-

working topic club website, where the site's users can comment on books. The website

can display all comments for a all book, as well as all books a user has read and com-

mented on.

You might build one sharded data store for the user data and another for the book data.

Comments have both a user ID and a post ID, so they cross the boundaries between

shards. Instead of completely duplicating comments, you can store the comments with

the user data. Then you can store just a comment's headline and ID with the book data.

This might be enough to render most views of a book's comments without accessing

both data stores, and if you need to display the complete comment text, you can retrieve

it from the user data store.

Querying across shards

Most sharded applications have at least some queries that need to aggregate or join

data from multiple shards. For example, if the book club site shows the most popular

or active users, it must by definition access every shard. Making such queries work well

is the most difficult part of implementing data sharding, because what the application

sees as a single query needs to be split up and executed in parallel as many queries, one

per shard. A good database abstraction layer can help ease the pain, but even then such

queries are so much slower and more expensive than in-shard queries that aggressive

caching is usually necessary as well.

Some languages, such as PHP, don't have good support for executing multiple queries

in parallel. A common way to work around this is to build a helper application, often

in C or Java, to execute the queries and aggregate the results. The PHP application then

Search WWH ::

Custom Search

Home