Design Patterns for Scaling - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

will be be if caching is added. This assumes an infinite cache with no expiration. If even

under these theoretically perfect conditions the cache hit ratio is low, we know that a cache

will not help. However, if there are duplicate queries, the cumulative size of the responses

to those queries will give a good estimate for sizing the cache.

Theproblemwithmeasurementsfromliveorbenchmarksystemsisthattheyrequirethe

system to exist. When designing a system it is important to be able to make a reasonable

prediction of what cache size will be required.

We can improve our estimate by using a cache simulator. These tools can be used to

provide “what if” analysis to determine the minimum cache size.

Once the cache is in place, the cache hit ratio should be monitored. The cache size can

be reevaluated periodically, increasing it to improve performance as needed.

5.5 Data Sharding

Sharding is a way to segment a database ( z -axis) that is flexible, scalable, and resilient. It

divides the database based on the hash value of the database keys.

A hash function is an algorithm that maps data of varying lengths to a fixed-length

value. The result is considered probabilistically unique. For example, the MD5 algorithm

returns a 128-bit number for any input. Because there are

340,282,366,920,938,463,463,374,607,431,768,211,456possiblecombinations,thechance

of two inputs producing the same hash is very small. Even a small change in the data

creates a big change in the hash. The MD5 hash of “Jennifer” is

e1f6a14cd07069692017b53a8ae881f6 but the MD5 hash of “Gennifer” is 1e49b-

be95b90646dca5c46a8d8368dab.

To divide a database into two shards, generate the hash of the key and store keys with

even hashes in one database and keys with odd hashes in the other database. To divide a

database into four shards, split the database based on the remainder of the key's hash di-

vided by 4 (i.e., the hash mod 4). Since the remainder will be 0, 1, 2, or 3, this will indicate

which of the four shards will store that key. Because the hash values are randomly dis-

tributed between the shards, each shard will store approximately the same number of keys

automatically.Thispatterniscalleda distributed hash table (DHT) sinceitdistributesthe

data over many machines, and uses hashes to determine where the data is stored.

Search WWH ::

Custom Search

Home