An Overview of the NoSQL World - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

responsible for shipping the updates and data definition language operations to the

secondary replicas.

Since some partitions may experience higher load than others, the simple tech-

nique of balancing the number of primary and secondary partitions per node might

not balance the loads. The system can rebalance dynamically using the failover

mechanism to tell a secondary on a lightly loaded server to become the primary,

by either demoting the former primary to secondary, or moving the former primary

to another server. A keyed table group can be partitioned dynamically. If a parti-

tion exceeds the maximum allowable partition size (either in bytes or the amount of

operational load it receives), it is split into two partitions. In general, the size of each

hosted SQL Azure database cannot exceed the limit of 50 GB.

9.5 WEB SCALE DATA MANAGEMENT: TRADEOFFS

An important issue in designing large-scale data management applications is to

avoid the mistake of trying to be “ everything for everyone .” As with many types

of computer systems, no one system can be best for all workloads and different

systems make different tradeoffs to optimize for different applications. Therefore,

the most challenging aspects in these application is to identify the most important

features of the target application domain and to decide about the various design

tradeoffs, which immediately lead to performance tradeoffs. To tackle this prob-

lem, Jim Gray came up with the heuristic rule of “ 20 queries ” [38]. The main idea

of this heuristic is that on each project, we need to identify the 20 most important

questions the user wanted the data system to answer. He argued that five questions

are not enough to see a broader pattern, and a hundred questions would result in a

shortage of focus.

In general, it is hard to maintain ACID guarantees in the face of data replication

over large geographic distances. The CAP theorem [15,34] shows that a shared-

data system can only choose at most two out of three properties: Consistency (all

records are the same in all replicas), Availability (a replica failure does not prevent

the system from continuing to operate), and tolerance to Partitions (the system

still functions when distributed replicas cannot talk to each other). When data

is replicated over a wide area, this essentially leaves just consistency and avail-

ability for a system to choose between. Thus, the C (consistency) part of ACID is

typically compromised to yield reasonable system availability [2]. Therefore, most

of the cloud data management overcomes the difficulties of distributed replica-

tion by relaxing the ACID guarantees of the system. In particular, they implement

various forms of weaker consistency models (e.g., eventual consistency, timeline

consistency, session consistency [60]) so that all replicas do not have to agree on

the same value of a data item at every moment of time. Hence, NoSQL systems

can be classified based on their support of the properties of the CAP theorem into

three categories:

•

CA systems : Consistent and highly available, but not partition-tolerant

•

CP systems : Consistent and partition-tolerant, but not highly available

•

AP systems : Highly available and partition-tolerant, but not consistent

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home