Databases Reference
In-Depth Information
More specifically, data of the users who are socially connected should be stored
on servers within short reach from each other. An ideal storage scheme should be
socially aware .
However, the most prominent distributed storage scheme for OSNs, Cassandra
[ 21 ], is not socially aware. Originally deployed for Facebook to enhance its Inbox
Search feature and now an Apache project, Cassandra has been used by most
popular OSNs including Facebook, Twitter, Digg, and Reddit. While there exist
well-known distributed file and relational database systems such as Ficus [ 33 ], Coda
[ 36 ], GFS [ 14 ], Farsite [ 1 ], and Bayou [ 37 ], these systems do not scale with high
read/write rates which is the case for OSNs. Cassandra's purpose is to be able to
run on top of an infrastructure of many commodity storage hosts, possibly spread
across different data centers, with high write throughput without sacrificing read
efficiency. Cassandra is a key-value store resembling a combination of a BigTable
data model [ 8 ] running on an Amazon's Dynamo-like infrastructure [ 11 ]. The data
partitioning scheme underlying both Cassandra and Dynamo is based on consistent
hashing [ 17 ], using an order-preserving DHT. Data is hashed to random servers,
thus breaking their social locality.
Aimed to improve system performance and scalability by enforcing social
locality in the data storage, socially aware data partitioning and replication schemes
have been proposed. SPAR [ 32 ] is one such scheme that preserves social locality
perfectly by requiring every two neighbor users to have their data colocated on
the same servers. Since this is impossible if each user has only one copy of its
data, replicas are introduced and placed appropriately in the servers such that the
number of replicas needed to ensure perfect social locality is minimum. Another
scheme is SCHISM [ 9 ], which can partition and replicate user data of a social
graph efficiently by taking into account transaction workload such as how often two
users are involved in the same transaction. In this topic, we introduce two socially
aware techniques for data partitioning and replication, S-PUT and S-CLONE, which
our research group has recently devised. S-PUT is specifically designed for data
partitioning, whereas S-CLONE is for data replication that assumes an existing
data partition across the servers; this underlying data partition can be any arbitrary
partition, e.g., a result of running Cassandra or S-PUT. Unlike SPAR and SCHISM,
S-CLONE attempts to maximize social locality under a fixed space budget for
replication. S-PUT and S-CLONE can be deployed separately or work together in a
distributed storage system.
While later chapters of the topic are focused on the issue of social locality,
formulation of socially aware data partitioning and replication as optimization
problems, and discussion of S-PUT and S-CLONE, this chapter is devoted to a
brief review of Dynamo, BigTable, and Cassandra, three key techniques that form
the data storage infrastructure for most OSNs today. The review is a summary of the
details about these techniques as described in their source publications.
Search WWH ::




Custom Search