Using NoSQL to manage big data - Making Sense of NoSQL

Databases Reference

In-Depth Information

efficient to send the query to each node than it is to transfer large datasets to a central

processor. This may seem obvious, but it's amazing how many traditional databases

still can't distribute queries and aggregate query results.

This simple rule helps you understand how NoSQL databases can have dramatic

performance advantages over systems that weren't designed to distribute queries to

the data nodes. Consider an RDBMS that has tables distributed over two different

nodes. In order for the SQL query to work, information about rows on one table must

all be moved across the network to the other node. Larger tables result in more data

movement, which results in slower queries. Think of all the steps involved. The tables

can be extracted, serialized, sent through the network interface, transmitted over net-

works, reassembled, and then compared on the server with the SQL query.

Keeping all the data within each data node in the form of logical documents

means that only the query itself and the final result need to be moved over a network.

This keeps your big data queries fast.

6.8.2

Using hash rings to evenly distribute data on a cluster

One of the most challenging problems with distributed databases is figuring out a

consistent way of assigning a document to a processing node. Using a hash ring tech-

nique to evenly distribute big data loads over many servers with a randomly generated

40-character key is a good way to evenly distribute a network load.

Hash rings are common in big data solutions because they consistently determine

how to assign a piece of data to a specific processor. Hash rings take the leading bits of

a document's hash value and use this to determine which node the document should

be assigned. This allows any node in a cluster to know what node the data lives on and

how to adapt to new assignment methods as your data grows. Partitioning keys into

ranges and assigning different key ranges to specific nodes is known as keyspace man-

agement . Most NoSQL systems, including MapReduce, use keyspace concepts to man-

age distributed computing problems.

In chapters 3 and 4 we reviewed the concept of hashing, consistent hashing, and

key-value stores. A hash ring uses these same concepts to assign an item of data to a

specific node in a NoSQL database cluster. Figure 6.11 is a diagram of a sample hash

ring with four nodes.

As you can see from the figure, each input will be assigned to a node based on the

40-character random key. One or more nodes in your cluster will be responsible for

storing this key-to-node mapping algorithm. As your database grows, you'll update the

algorithm so that each new node will also be assigned some range of key values. The

algorithm also needs to replicate items with these ranges from the old nodes to the

new nodes.

The concept of a hash ring can also be extended to include the requirement that

an item must be stored on multiple nodes. When a new item is created, the hash ring

rules might indicate both a primary and a secondary copy of where an item is stored.

If the node that contains the primary fails, the system can look up the node where the

secondary item is stored.

Making Sense of NoSQL

Search WWH ::

Custom Search

Home