Databases Reference
In-Depth Information
MD5 hash
function
d41d8cd98f00b204e9800998ecf8427e
invoice.xml
let $hash := hash($invoice, 'md5')
Figure 2.6 Sample hashing process. An input document such as a
business invoice is sent through a hashing function. The result of the
hashing function is a string that's unique to the original document. A
change of a single byte in the input will return a different hash string. A
hash can be used to see if a document has changed or if it's already
located in a RAM cache.
document is the same as one already in your cache. Knowing this information pre-
vents you from making unnecessary calls to disk for information and keeps your data-
bases running fast.
A hash string (also known as a check-
sum or hash ) is a process that calculates
a sequence of letters by looking at each
byte of a document. The hash string
uniquely identifies each document
and can be used to determine whether
the document you're presented with is
the same document you already have
on hand. If there's any difference
between two documents (even a single
byte), the resulting hash will be differ-
ent. Since the 1990s, hash strings have
been created using standardized algo-
rithms such as MD5 , SHA-1 , SHA-256,
and RIPEMD-160 . Figure 2.6 illustrates
a typical hashing process.
Hash values can be created for
simple queries or complex JSON or
XML documents. Once you have your
hash value, you can use it to make
sure that the information you're send-
ing is the same information others are
receiving. Consistent hashing occurs
when two different processes running
on different nodes in your network
create the same hash for the same
object. Consistent hashing confirms
that the information in the document
hasn't been altered and allows you to
Hash collisions
There's an infinitesimally small chance
that two different documents could gener-
ate the same hash value, resulting in a
hash collision . The likelihood of this occur-
ring is related to the length of the hash
value and how many documents you're
storing. The longer the hash, the lower
the odds of a collision. As you add more
documents, the chance of a collision
increases. Many systems use the MD5
hash algorithm that generates a 128-bit
hash string. A 128-bit hash can generate
approximately 10 38 possible outputs.
That means that if you want to keep the
odds of a collision low, for example odds
of under one in 10 18 , you want to limit the
number of documents you keep to under
10 13 , or about 10 trillion documents.
For most applications that use hashing,
accidental hash collisions aren't a con-
cern. But there are situations where
avoiding hash collisions is important. Sys-
tems that use hashes for security verifica-
tion, like government or high-security
systems, require hash values to be
greater than 128 bits. In these situations,
algorithms that generate a hash value
greater than 128 bits like SHA-1, SHA-
256, SHA-384, or SHA-512 are preferred.
Search WWH ::




Custom Search