Databases Reference
In-Depth Information
equally populated in total, and you can choose them with an affinity that helps avoid
cross-shard queries.
Multiple partitioning keys
Complicated data models make data sharding more difficult. Many applications have
more than one partitioning key, especially if there are two or more important “dimen-
sions” in the data. In other words, the application might need to see an efficient, co-
herent view of the data from different angles. This means you might need to store at
least some data twice within the system.
For example, you might need to shard your blogging application's data by both the
user ID and the post ID, because these are two common ways the application looks at
the data. Think of it this way: you frequently want to see all posts for a user, and all
comments for a post. But sharding by user doesn't help you find comments for a post,
and sharding by post doesn't help you find posts for a user. If you need both types of
queries to touch only a single shard, you'll have to shard both ways.
Just because you need multiple partitioning keys doesn't mean you'll need to design
two completely redundant data stores. Let's look at another example: a social net-
working topic club website, where the site's users can comment on books. The website
can display all comments for a all book, as well as all books a user has read and com-
mented on.
You might build one sharded data store for the user data and another for the book data.
Comments have both a user ID and a post ID, so they cross the boundaries between
shards. Instead of completely duplicating comments, you can store the comments with
the user data. Then you can store just a comment's headline and ID with the book data.
This might be enough to render most views of a book's comments without accessing
both data stores, and if you need to display the complete comment text, you can retrieve
it from the user data store.
Querying across shards
Most sharded applications have at least some queries that need to aggregate or join
data from multiple shards. For example, if the book club site shows the most popular
or active users, it must by definition access every shard. Making such queries work well
is the most difficult part of implementing data sharding, because what the application
sees as a single query needs to be split up and executed in parallel as many queries, one
per shard. A good database abstraction layer can help ease the pain, but even then such
queries are so much slower and more expensive than in-shard queries that aggressive
caching is usually necessary as well.
Some languages, such as PHP, don't have good support for executing multiple queries
in parallel. A common way to work around this is to build a helper application, often
in C or Java, to execute the queries and aggregate the results. The PHP application then
 
Search WWH ::




Custom Search