Database Reference
In-Depth Information
Partial denormalization
Our initial approach to home timelines, which used the existing, fully normalized data
structure that we've already built, is technically viable but will perform very poorly at scale.
If I follow F users and want a page of size P for my home timeline, Cassandra will need to
do the following:
• Query F partitions for P rows each
• Perform an ordered merge of F x P rows in order to retrieve only the most recent P
The most distressing part of this is the fact that both operations grow in complexity propor-
tionally with the number of people I follow. Let's start by trying to fix this.
The basic goal of the home timeline is to show me the most recent status updates that mat-
ter to me. Instead of doing all the work to find out what status updates matter to me, based
on who I follow, at read time, let's shift some of the work to write time.
I'll create a table that stores references to status updates that I care about. Whenever
someone I follow creates a new status update, I'll add a reference to that update to my home
timeline list; this way, when I want to view my home timeline, I'll know exactly what status
updates to return.
Let's create our new table:
CREATE TABLE "home_status_update_ids" (
"timeline_username" text,
"status_update_id" timeuuid,
"status_update_username" text,
PRIMARY KEY ("timeline_username", "status_update_id")
)
WITH CLUSTERING ORDER BY ("status_update_id" DESC);
Note that there's something subtle going on here with the primary key. In the
user_status_updates table, the username column of the author is the partition
key, and the timestamp UUID, id , is the clustering column. In order to retrieve a specific
row from this table, we need to provide both the username and the id columns.
However, our home_status_update_ids table does not include the author's user-
name in the primary key: the status_update_username column is just a normal data
Search WWH ::




Custom Search