Databases Reference
In-Depth Information
Data sharding
Data sharding 7 is the most common and successful approach for scaling today's very
large MySQL applications. You shard the data by splitting it into smaller pieces, or
shards, and storing them on different nodes.
Sharding works well when combined with some type of functional partitioning. Most
sharded systems also have some “global” data that isn't sharded at all (say, lists of cities,
or login data). This global data is usually stored on a single node, often behind a cache
such as memcached .
In fact, most applications shard only the data that needs sharding—typically, the parts
of the dataset that will grow very large. Suppose you're building a blogging service. If
you expect 10 million users, you might not need to shard the user registration infor-
mation because you might be able to fit all of the users (or the active subset of them)
entirely in memory. If you expect 500 million users, on the other hand, you should
probably shard this data. The user-generated content, such as posts and comments,
will almost certainly require sharding in either case, because these records are much
larger and there are many more of them.
Large applications might have several logical datasets that you can shard differently.
You can store them on different sets of servers, but you don't have to. You can also
shard the same data multiple ways, depending on how you access it. We show an
example of this approach later.
Sharding is dramatically different from the way most applications are designed initially,
and it can be difficult to change an application from a monolithic data store to a sharded
architecture. That's why it's much easier to build an application with a sharded data
store from the start if you anticipate that it will eventually need one.
Most applications that don't build in sharding from the beginning go through stages
as they get larger. For example, you can use replication to scale read queries on your
blogging service until it doesn't work any more. Then you can split the service into
three parts: users, posts, and comments. You can place these on different servers (func-
tional partitioning), perhaps with a service-oriented architecture, and perform the joins
in the application. Figure 11-6 shows the evolution from a single server to functional
partitioning.
Finally, you can shard the posts and comments by the user ID, and keep the user in-
formation on a single node. If you keep a master-replica configuration for the global
node and use master-master pairs for the sharded nodes, the final data store might look
like Figure 11-7 .
7. Sharding is also called “splintering” and “partitioning,” but we use the term “sharding” to avoid
confusion. Google calls it sharding, and if it's good enough for Google, it's good enough for us.
 
Search WWH ::




Custom Search