Emerging Database Landscape - Big Data Imperatives

Databases Reference

In-Depth Information

node held a portion of the total data, a technique referred to as sharding , for breaking the

database up into shards. The queries are broken into sub-queries, which are then applied

to specific nodes in a server cluster. The results from each one of these sub-queries are then

aggregated to get the final answer. All resources are exploited to run in parallel. To improve

performance or cater to larger data volumes, more nodes are added to the cluster as and

when needed.

Most NoSQL databases have a scale-out architecture and can be distributed across

many server nodes. How they handle data distribution, data compression, and node

failure varies from product to product, but the general architecture is similar. They are

usually built in a shared-nothing manner so that no node has to know much about what's

happening on other nodes.

The scale-out architecture brings to light two interesting features, and both of these

features focus on the ability to distribute data over a cluster of servers.

Replication: This is all about taking the same data and copying it over multiple

nodes. There are two types of replication strategies.

Master-Slave

•

In Master-Slave approach, you replicate data across multiple nodes. One node acts

as the designated master and the rest are slave nodes keeping copies of the entire data

sets, thereby providing resilience to node failures. The master node is the most updated

and accurate source for the data sets and is responsible for managing consistency.

Periodically, the slaves synchronize their content with the master.

Master-Slave replication is most helpful for scaling when you have a read-intensive

data set. You can scale horizontally to handle more read requests by adding more slave

nodes and ensuring all read requests are routed to the slaves. However, this approach will

have a major bottleneck when you have workloads that are read- and write-intensive, the

master will have to juggle around updates and pass on those updates to the slave nodes to

make the data consistent everywhere!

While the Master-Slave approach provides read scalability, it severely lacks in write

scalability. Peer-to-Peer replication approach addresses this issue by not having a master

node altogether. All replication nodes have equal weight, they all accept write requests,

and the loss of any of the nodes doesn't prevent access to the data store because rest of

the nodes are accessible and have the copies of the same data, although it may not be the

most updated data.

In this approach, the concerning fact is about data consistency across all the nodes:

when you perform write operations on two different nodes on the same data set, you run

into the risk of two different users attempting to update the same record at the same time

thus introducing a write-write conflict. This sort of write-write conflicts are managed

through a concept called “serialization” wherein, you decide to apply the write operations

one after another. Serialization is applied either as pessimistic or optimistic mode.

Pessimistic works by preventing conflicts from occurring, in a sense, all write operations

are performed in a sequential manner, when all are done, and then only the data set is

made available. Optimistic works by letting conflicts occur but detects the instances of

conflict and later takes corrective actions to sort them out, making all the write operations

eventually consistent.

Peer-To-Peer

Big Data Imperatives

Search WWH ::

Custom Search

Home