Databases Reference
In-Depth Information
efficient to send the query to each node than it is to transfer large datasets to a central
processor. This may seem obvious, but it's amazing how many traditional databases
still can't distribute queries and aggregate query results.
This simple rule helps you understand how NoSQL databases can have dramatic
performance advantages over systems that weren't designed to distribute queries to
the data nodes. Consider an RDBMS that has tables distributed over two different
nodes. In order for the SQL query to work, information about rows on one table must
all be moved across the network to the other node. Larger tables result in more data
movement, which results in slower queries. Think of all the steps involved. The tables
can be extracted, serialized, sent through the network interface, transmitted over net-
works, reassembled, and then compared on the server with the SQL query.
Keeping all the data within each data node in the form of logical documents
means that only the query itself and the final result need to be moved over a network.
This keeps your big data queries fast.
6.8.2
Using hash rings to evenly distribute data on a cluster
One of the most challenging problems with distributed databases is figuring out a
consistent way of assigning a document to a processing node. Using a hash ring tech-
nique to evenly distribute big data loads over many servers with a randomly generated
40-character key is a good way to evenly distribute a network load.
Hash rings are common in big data solutions because they consistently determine
how to assign a piece of data to a specific processor. Hash rings take the leading bits of
a document's hash value and use this to determine which node the document should
be assigned. This allows any node in a cluster to know what node the data lives on and
how to adapt to new assignment methods as your data grows. Partitioning keys into
ranges and assigning different key ranges to specific nodes is known as keyspace man-
agement . Most NoSQL systems, including MapReduce, use keyspace concepts to man-
age distributed computing problems.
In chapters 3 and 4 we reviewed the concept of hashing, consistent hashing, and
key-value stores. A hash ring uses these same concepts to assign an item of data to a
specific node in a NoSQL database cluster. Figure 6.11 is a diagram of a sample hash
ring with four nodes.
As you can see from the figure, each input will be assigned to a node based on the
40-character random key. One or more nodes in your cluster will be responsible for
storing this key-to-node mapping algorithm. As your database grows, you'll update the
algorithm so that each new node will also be assigned some range of key values. The
algorithm also needs to replicate items with these ranges from the old nodes to the
new nodes.
The concept of a hash ring can also be extended to include the requirement that
an item must be stored on multiple nodes. When a new item is created, the hash ring
rules might indicate both a primary and a secondary copy of where an item is stored.
If the node that contains the primary fails, the system can look up the node where the
secondary item is stored.
Search WWH ::




Custom Search