Database Reference
In-Depth Information
Table 3.1
Design decisions of various web scale data management systems
System
Data model
Query interface
Consistency
CAP options
License
Dynamo
Key-value
API
Eventual
AP
Inter@AMZN
PNUTS
Key-value
API
Timeline
AP
Inter@YHOO
Bigtable
Column families
API
Strict
CP
Inter@GOOG
Cassandra
Column families
API
Tunable
AP
Apache
HBase
Column families
API
Strict
CP
Apache
Hypertable
Mul-dim. Tab
API/HQL
Eventual
AP
GNU
CouchDB
Document
API
Eventual
AP
Apache
SimpleDB
Key-value
API
Multiple
AP
Commercial
S3
Large obj.
API
Eventual
AP
Commercial
Table storage
Key-value
API/LINQ
Strict
AP/CP
Commercial
Blob storage
Large obj.
API
Strict
AP/CP
Commercial
Datastore
Column families
API/GQL
Strict
CP
Commercial
RDS
Relational
SQL
Strict
CA
Commercial
Azure SQL
Relational
SQL
Strict
CA
Commercial
Cloud SQL
Relational
SQL
Strict
CA
Commercial
￿
Latency versus durability : Writes may be synched to disk before the system
returns success to the user or they may be stored in memory at write time
and synched later. The advantages of the latter approach are that avoiding
disk access greatly improves write latency, and potentially improves throughput
The disadvantage is the greater risk of data loss if a server crashes and loses
unsynched updates.
￿
Synchronous versus asynchronous replication : Synchronous replication ensures
all copies are up to date but potentially incurs high latency on updates. Further-
more, availability may be impacted if synchronously replicated updates cannot
complete while some replicas are offline. Asynchronous replication avoids high
write latency but allows replicas to be stale. Furthermore, data loss may occur if
an update is lost due to failure before it can be replicated.
￿
Data partitioning : Systems may be strictly row-based or allow for column
storage. Row-based storage supports efficient access to an entire record and is
ideal if we typically access a few records in their entirety. Column-based storage
is more efficient for accessing a subset of the columns, particularly when multiple
records are accessed.
Florescu and Kossmann [ 133 ] argued that in a cloud environment, the main
metric that needs to be optimized is the cost as measured in dollars. Therefore, the
big challenge of data management applications is no longer on how fast a database
workload can be executed or whether a particular throughput can be achieved;
instead, the challenge is how many machines are necessary to meet the performance
requirements of a particular workload. This argument fits well with a rule of thumb
calculation which has been proposed by Jim Gray regarding the opportunity costs
of distributed computing in the Internet as opposed to local computations [ 139 ].
Gray reasons that except for highly processing-intensive applications outsourcing
 
Search WWH ::




Custom Search