Databases Reference
In-Depth Information
recruited from remote locations and a new generation of distributed revision control
systems was needed.
In response to the demand of distributed development, a new class of distributed
revision control systems ( DRC s) emerged. Systems like Subversion, Git, and Mercurial
have the ability to store local copies of a revisioned database and quickly sync up to a
master copy when needed. They do this by calculating a hash of each of the revision
objects (directories as well as files) in the system. When remote systems need to be
synced, they compare the hashes, not the individual files, which allows syncing even
on large and deep trees of data to occur quickly.
The data structure used to detect if two trees are the same is called a hash tree or
Merkle tree . Hash trees work by calculating the hash values of each leaf of a tree, and
then using these hash values to create a node object . Node objects can then be hashed
and result in a new hash value for the entire directory. An example of this is shown in
figure 3.13.
Hashes of root
node
hash
Hashes of
hashes
hash
hash
hash
doc
hash
hash
hash
hash
hash
Hashes of
individual files
doc
doc
doc
doc
doc
Figure 3.13 A hash tree, or Merkle tree, is created by calculating the
hash of all of the leaf structures in a tree. Once the leaf structures have
been hashed, all the nodes within a directory combine their hash values
to create a new document that can also be hashed. This “hash of
hashes” becomes the hash of the directory. This hash value can in turn
be used to create a hash of the parent node. In this way you can
compare the hashes of any point in two trees and immediately know if
all of the structures below a particular node are the same.
Hash trees are used in most distributed revision control systems. If you make a copy of
your current project's software and store it on your laptop and head to the North
Woods for a week to write code, when you return you simply reconnect to the network
and merge your changes with all the updates that occurred while you were gone. The
software doesn't need to do a byte-by-byte comparison to figure out what revision to
use. If your system has a directory with the same hash value as the base system, the soft-
ware instantly knows they're the same by comparing the hash values.
The “gone to the North Woods for a week” synchronization scenario is similar to
the problem of what happens when any node on a distributed database is discon-
nected from other nodes for a period of time. You can use the same data structures
and algorithms to keep NoSQL databases in sync as in revision control systems.
 
Search WWH ::




Custom Search