Databases Reference
In-Depth Information
Chapter 2
Social Locality in Data Storage
The locality property in data storage can be interpreted in different ways. In
Cassandra a column family is a group of columns that are frequently accessed
together, e.g., name, address, phone number, and email address information. These
columns therefore have the same row key resulting in their being stored on the
same machine. Data locality of this kind is content-based. By social locality, we are
refering to the data that are accessed by users that share some social relationship.
Therefore, although these data may be content-wise unrelated, they are frequently
queried together in an online social network and therefore should be stored in close
proximity on disk. Another way to look at locality is in terms of geography. It may
be desirable to store in the same server the data for those users that reside in the
same geographic region (e.g., think Akamai).
Although content- and geography-based locality have been taken into account in
the literature of data storage, social locality is an emerging concept. Social locality
is not achieved by Cassandra which uses random-based consistent hashing to assign
data to servers. To illustrate the benefit of social locality, consider a 10-node social
graph in Fig. 2.1 a stored across three servers, A, B,andC , in a distributed manner.
Suppose that using Cassandra partitioning we have the partition shown in Fig. 2.1 b.
We allow one replica for each user in addition to the primary copy. Figure 2.1 c
shows the result of using a random replication algorithm, such as the Rack Unaware
strategy of Cassandra, on top of this partition; this algorithm basically randomly
places the replicas among the three servers. In contrast, Fig. 2.1 d shows the result
of running an (imaginary) replication algorithm that preserves social locality, trying
to place data of neighbor nodes on the same server as much as possible. Table 2.1
summarizes the cost to read the data for each of the ten users, showing a noticeable
improvement of socially aware replication over random replication (24% better).
This example, albeit its simplicity, supports the importance of social locality in
distributed data storage.
Search WWH ::




Custom Search