An Overview of the NoSQL World - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

throughput The disadvantage is the greater risk of data loss if a server

crashes and loses unsynched updates.

•

Synchronous vs. asynchronous replication : Synchronous replication

ensures all copies are up-to-date but potentially incurs high latency on

updates. Furthermore, availability may be impacted if synchronously

replicated updates cannot complete while some replicas are offline.

Asynchronous replication avoids high write latency but allows replicas to

be stale. Furthermore, data loss may occur if an update is lost due to failure

before it can be replicated.

•

Data partitioning : Systems may be strictly row-based or allow for column

storage. Row-based storage supports efficient access to an entire record and

is ideal if we typically access a few records in their entirety. Column-based

storage is more efficient for accessing a subset of the columns, particularly

when multiple records are accessed.

Florescu and Kossmann [32] argued that in a cloud environment, the main

metric that needs to be optimized is the cost as measured in dollars. Therefore,

the big challenge of data management applications is no longer on how fast a

database workload can be executed or whether a particular throughput can be

achieved; instead, the challenge is how many machines are necessary to meet

the performance requirements of a particular workload. This argument fits well

with a rule-of-thumb calculation that has been proposed by Jim Gray regarding

the opportunity costs of distributed computing on the Internet as opposed to local

computations [35]. Gray reasons that except for highly processing-intensive appli-

cations outsourcing computing tasks into a distributed environment does not pay

off because network traffic fees outnumber savings in processing power. In princi-

ple, calculating the tradeoff between basic computing services can be useful to get

a general idea of the economies involved. This method can easily be applied to the

pricing schemes of cloud computing providers (e.g., Amazon, Google). Florescu

and Kossmann [32] have also argued in the new large-scale web applications, the

requirement to provide 100% read and write availability for all users has over-

shadowed the importance of the ACID paradigm as the gold standard for data

consistency. In these applications, no user is ever allowed to be blocked. Hence,

consistency has turned to be an optimization goal in modern data management

systems to minimize the cost of resolving inconsistencies and not a constraint as in

traditional database systems. Therefore, it is better to design a system that it deals

with resolving inconsistencies rather than having a system that prevents inconsis-

tencies under all circumstances.

Kossmann et al. [41] conducted an end-to-end experimental evaluation for the

performance and cost of running enterprise web applications with OLTP workloads

on alternative cloud services (e.g., RDS, SimpleDB, S3, Google AppEngine, Azure).

The results of the experiments showed that the alternative services varied greatly

both in cost and performance. Most services had significant scalability issues. They

confirmed the observation that public clouds lack of support for uploading large data

volumes. It was difficult for them to upload 1 TB or more of raw data through the

APIs provided by the providers. With regard to cost, they concluded that Google

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home