Databases Reference
In-Depth Information
you'll see how each of these architectures works to solve big data problems with differ-
ent types of data.
Of the architectural data patterns we've discussed so far (row store, key-value store,
graph store, document store, and Bigtable store), only two (key-value store and docu-
ment store) lend themselves to cache-friendliness. Bigtable stores scale well on shared-
nothing architectures because their row-column identifiers are similar to key-value
stores. But row stores and graph stores aren't cache-friendly since they don't allow a
large BLOB to be referenced by a short key that can be stored in the cache.
For graph traversals to be fast, the entire graph should be in main memory. This is
why graph stores work most efficiently when you have enough RAM to hold the graph.
If you can't keep your graph in RAM , graph stores will try to swap the data to disk,
which will decrease graph query performance by a factor of 1,000. The only way to
combat the problem is to move to a shared-memory architecture, where multiple
threads all access a large RAM structure without the graph data moving outside of the
shared RAM .
The rule of thumb is if you have over a terabyte of highly connected graph data
and you need real-time analysis of this graph, you should be looking for an alternative
to a shared-nothing architecture. A single CPU with 64 GB of RAM won't be sufficient
to hold your graph in RAM . Even if you work hard to only load the necessary data ele-
ments into RAM , your links may traverse other nodes that need to be swapped in from
disk. This will make your graph queries slow. We'll look into alternatives to this in a
case study later in this chapter.
Knowing the hardware options available to big data is an important first step, but
distributing software in a cluster is also important. Let's take a look at how software
can be distributed in a cluster.
6.6
Choosing distribution models:
master-slave versus peer-to-peer
From a distribution perspective, there are two main models: master-slave and peer-to-
peer. Distribution models determine the responsibility for processing data when a
request is made.
Understanding the pros and cons of each distribution model is important when
you're looking at a potential big data solution. Peer-to-peer models may be more resil-
ient to failure than master-slave models. Some master-slave distribution models have
single points of failure that might impact your system availability, so you might need to
take special care when configuring these systems.
Distribution models get to the heart of the question who's in charge here? There are
two ways to answer this question: one node or all nodes. In the master-slave model,
one node is in charge (master). When there's no single node with a special role in tak-
ing charge, you have a peer-to-peer distribution model.
Figure 6.7 shows how these models each work.
Search WWH ::




Custom Search