Using NoSQL to manage big data - Making Sense of NoSQL

Databases Reference

In-Depth Information

you'll see how each of these architectures works to solve big data problems with differ-

ent types of data.

Of the architectural data patterns we've discussed so far (row store, key-value store,

graph store, document store, and Bigtable store), only two (key-value store and docu-

ment store) lend themselves to cache-friendliness. Bigtable stores scale well on shared-

nothing architectures because their row-column identifiers are similar to key-value

stores. But row stores and graph stores aren't cache-friendly since they don't allow a

large BLOB to be referenced by a short key that can be stored in the cache.

For graph traversals to be fast, the entire graph should be in main memory. This is

why graph stores work most efficiently when you have enough RAM to hold the graph.

If you can't keep your graph in RAM , graph stores will try to swap the data to disk,

which will decrease graph query performance by a factor of 1,000. The only way to

combat the problem is to move to a shared-memory architecture, where multiple

threads all access a large RAM structure without the graph data moving outside of the

shared RAM .

The rule of thumb is if you have over a terabyte of highly connected graph data

and you need real-time analysis of this graph, you should be looking for an alternative

to a shared-nothing architecture. A single CPU with 64 GB of RAM won't be sufficient

to hold your graph in RAM . Even if you work hard to only load the necessary data ele-

ments into RAM , your links may traverse other nodes that need to be swapped in from

disk. This will make your graph queries slow. We'll look into alternatives to this in a

case study later in this chapter.

Knowing the hardware options available to big data is an important first step, but

distributing software in a cluster is also important. Let's take a look at how software

can be distributed in a cluster.

6.6

Choosing distribution models:

master-slave versus peer-to-peer

From a distribution perspective, there are two main models: master-slave and peer-to-

peer. Distribution models determine the responsibility for processing data when a

request is made.

Understanding the pros and cons of each distribution model is important when

you're looking at a potential big data solution. Peer-to-peer models may be more resil-

ient to failure than master-slave models. Some master-slave distribution models have

single points of failure that might impact your system availability, so you might need to

take special care when configuring these systems.

Distribution models get to the heart of the question who's in charge here? There are

two ways to answer this question: one node or all nodes. In the master-slave model,

one node is in charge (master). When there's no single node with a special role in tak-

ing charge, you have a peer-to-peer distribution model.

Figure 6.7 shows how these models each work.

Search WWH ::

Custom Search

Home