Introduction - Data Storage for Social Networks: A Socially Aware Approach

Databases Reference

In-Depth Information

More specifically, data of the users who are socially connected should be stored

on servers within short reach from each other. An ideal storage scheme should be

socially aware .

However, the most prominent distributed storage scheme for OSNs, Cassandra

[ 21 ], is not socially aware. Originally deployed for Facebook to enhance its Inbox

Search feature and now an Apache project, Cassandra has been used by most

popular OSNs including Facebook, Twitter, Digg, and Reddit. While there exist

well-known distributed file and relational database systems such as Ficus [ 33 ], Coda

[ 36 ], GFS [ 14 ], Farsite [ 1 ], and Bayou [ 37 ], these systems do not scale with high

read/write rates which is the case for OSNs. Cassandra's purpose is to be able to

run on top of an infrastructure of many commodity storage hosts, possibly spread

across different data centers, with high write throughput without sacrificing read

efficiency. Cassandra is a key-value store resembling a combination of a BigTable

data model [ 8 ] running on an Amazon's Dynamo-like infrastructure [ 11 ]. The data

partitioning scheme underlying both Cassandra and Dynamo is based on consistent

hashing [ 17 ], using an order-preserving DHT. Data is hashed to random servers,

thus breaking their social locality.

Aimed to improve system performance and scalability by enforcing social

locality in the data storage, socially aware data partitioning and replication schemes

have been proposed. SPAR [ 32 ] is one such scheme that preserves social locality

perfectly by requiring every two neighbor users to have their data colocated on

the same servers. Since this is impossible if each user has only one copy of its

data, replicas are introduced and placed appropriately in the servers such that the

number of replicas needed to ensure perfect social locality is minimum. Another

scheme is SCHISM [ 9 ], which can partition and replicate user data of a social

graph efficiently by taking into account transaction workload such as how often two

users are involved in the same transaction. In this topic, we introduce two socially

aware techniques for data partitioning and replication, S-PUT and S-CLONE, which

our research group has recently devised. S-PUT is specifically designed for data

partitioning, whereas S-CLONE is for data replication that assumes an existing

data partition across the servers; this underlying data partition can be any arbitrary

partition, e.g., a result of running Cassandra or S-PUT. Unlike SPAR and SCHISM,

S-CLONE attempts to maximize social locality under a fixed space budget for

replication. S-PUT and S-CLONE can be deployed separately or work together in a

distributed storage system.

While later chapters of the topic are focused on the issue of social locality,

formulation of socially aware data partitioning and replication as optimization

problems, and discussion of S-PUT and S-CLONE, this chapter is devoted to a

brief review of Dynamo, BigTable, and Cassandra, three key techniques that form

the data storage infrastructure for most OSNs today. The review is a summary of the

details about these techniques as described in their source publications.

Search WWH ::

Custom Search

Home