Databases Reference
In-Depth Information
To ensure read and write operations to never fail due to temporary node or
network outages, Dynamo handles failures with hinted handoff . For example, when
data needs to be replicated on a node that is currently not alive, the data will be
written to a random node to be stored. This random node receives a hint indicating
that the replica needs to be transferred back to the intended node when it is back
online.
Hinted handoff is sufficient when the failure is transient. There are cases where
a hinted replica becomes unavailable before it can be returned to the intended node.
To address this problem, Dynamo uses an anti-entropy protocol to keep the replicas
synchronized. In this protocol, each node maintains a separate Merkle tree [ 25 ]for
each key range it is responsible for and allows other nodes to verify if a key within a
key range is up to date. The advantage of Merkel tree are twofold. First, each branch
of the tree can be checked independently without having to download the entire tree.
Second, it is easy to compare if two trees are the same by comparing the hash values
at their roots. Therefore, Merkle tree can minimize the amount of data needed to
transfer when checking for replica inconsistencies. In Dynamo, if two nodes share a
key range, they exchange the roots of the Merkel trees. Having identical root values
implies no inconsistency; all the replicas are up-to-date. In the case of different root
values, the nodes keep exchanging hash values of the children recursively until the
leave nodes of the trees are reached. At that point we can identify the replica that is
stale and perform synchronization accordingly.
1.2
Google's BigTable
BigTable [ 7 ] is a schema-free distributed high performance database system built
on Google's technologies such as Google File System, Chubby Lock Service [ 4 ],
and Sorted String Table (SSTable). BigTable does not provide support for a full
relational model; instead, it allows dynamic control over data layout. BigTable is
richer than key-value and can be used to manage sparse and semi-structured data.
1.2.1
Table Format
A BigTable table is essentially a sparse persistent multidimensional sorted map.
Each cell value in the map is stored as an array of bytes and indexed by a triple
(row-key, column-key, time-stamp) . Every value is associated with a timestamp t i .
Timestamp is a 64 bit integer used to recognize different revisions of a cell value.
BigTable orders these values in decreasing order of timestamp so that the newest
version is always read first.
Row key is a string of up to 64 KB in size. Rows are lexically ordered and
grouped into tablets . When the data is distributed across the servers, it is the tablets
Search WWH ::




Custom Search