Information Technology Reference
In-Depth Information
will be be if caching is added. This assumes an infinite cache with no expiration. If even
under these theoretically perfect conditions the cache hit ratio is low, we know that a cache
will not help. However, if there are duplicate queries, the cumulative size of the responses
to those queries will give a good estimate for sizing the cache.
Theproblemwithmeasurementsfromliveorbenchmarksystemsisthattheyrequirethe
system to exist. When designing a system it is important to be able to make a reasonable
prediction of what cache size will be required.
We can improve our estimate by using a cache simulator. These tools can be used to
provide “what if” analysis to determine the minimum cache size.
Once the cache is in place, the cache hit ratio should be monitored. The cache size can
be reevaluated periodically, increasing it to improve performance as needed.
5.5 Data Sharding
Sharding is a way to segment a database ( z -axis) that is flexible, scalable, and resilient. It
divides the database based on the hash value of the database keys.
A hash function is an algorithm that maps data of varying lengths to a fixed-length
value. The result is considered probabilistically unique. For example, the MD5 algorithm
returns a 128-bit number for any input. Because there are
340,282,366,920,938,463,463,374,607,431,768,211,456possiblecombinations,thechance
of two inputs producing the same hash is very small. Even a small change in the data
creates a big change in the hash. The MD5 hash of “Jennifer” is
e1f6a14cd07069692017b53a8ae881f6 but the MD5 hash of “Gennifer” is 1e49b-
be95b90646dca5c46a8d8368dab.
To divide a database into two shards, generate the hash of the key and store keys with
even hashes in one database and keys with odd hashes in the other database. To divide a
database into four shards, split the database based on the remainder of the key's hash di-
vided by 4 (i.e., the hash mod 4). Since the remainder will be 0, 1, 2, or 3, this will indicate
which of the four shards will store that key. Because the hash values are randomly dis-
tributed between the shards, each shard will store approximately the same number of keys
automatically.Thispatterniscalleda distributed hash table (DHT) sinceitdistributesthe
data over many machines, and uses hashes to determine where the data is stored.
Search WWH ::




Custom Search