Social Locality in Data Storage - Data Storage for Social Networks: A Socially Aware Approach

Databases Reference

In-Depth Information

Chapter 2

Social Locality in Data Storage

The locality property in data storage can be interpreted in different ways. In

Cassandra a column family is a group of columns that are frequently accessed

together, e.g., name, address, phone number, and email address information. These

columns therefore have the same row key resulting in their being stored on the

same machine. Data locality of this kind is content-based. By social locality, we are

refering to the data that are accessed by users that share some social relationship.

Therefore, although these data may be content-wise unrelated, they are frequently

queried together in an online social network and therefore should be stored in close

proximity on disk. Another way to look at locality is in terms of geography. It may

be desirable to store in the same server the data for those users that reside in the

same geographic region (e.g., think Akamai).

Although content- and geography-based locality have been taken into account in

the literature of data storage, social locality is an emerging concept. Social locality

is not achieved by Cassandra which uses random-based consistent hashing to assign

data to servers. To illustrate the benefit of social locality, consider a 10-node social

graph in Fig. 2.1 a stored across three servers, A, B,andC , in a distributed manner.

Suppose that using Cassandra partitioning we have the partition shown in Fig. 2.1 b.

We allow one replica for each user in addition to the primary copy. Figure 2.1 c

shows the result of using a random replication algorithm, such as the Rack Unaware

strategy of Cassandra, on top of this partition; this algorithm basically randomly

places the replicas among the three servers. In contrast, Fig. 2.1 d shows the result

of running an (imaginary) replication algorithm that preserves social locality, trying

to place data of neighbor nodes on the same server as much as possible. Table 2.1

summarizes the cost to read the data for each of the ten users, showing a noticeable

improvement of socially aware replication over random replication (24% better).

This example, albeit its simplicity, supports the importance of social locality in

distributed data storage.

Search WWH ::

Custom Search

Home