Database Reference
In-Depth Information
Answering the right question
Recall that the job of this table is to be able to answer questions such as, "How many total
status update views happened on each day of September?". In order to do so, we need to
store view counts aggregated daily, and we need to be able to query for ranges of days.
By forming a primary key out of the year, month, and day on which the views were coun-
ted, we aggregate at the right granularity: there will be one row per day, containing the
counts for that day. We use the year of the observation as the partition key, thus giving
ourselves the ability to retrieve a series of daily counts for any range of days within the
same year. In the event that we want to retrieve a series of daily counts for a range that
spans multiple years, our application will need to make multiple queries—one for each year
we're interested in— and then stitch together the results on the client side.
We use a timestamp column as the clustering column, storing the day we're counting views
on. The information in the year partition key is redundant with the year component of the
clustering column, so we'll never need to read it directly, but year-based partitions still al-
low us to keep the size of each partition reasonable and bounded.
For instance, if we want to retrieve total view counts for each day over the month of
September, it's as simple as this:
SELECT "date", "total_views"
FROM "daily_status_update_views"
WHERE "year" = 2014
AND "date" >= '2014-09-01'
AND "date" < '2014-09-30';
Another fact about the schema that's worth noting is the use of four data columns: a total
count, and three columns for client-specific counts. Another way to model the same in-
formation would have been to add another clustering column called, say, client_type ;
then, for each day, we'd have four rows—one per client type. One approach is not obvi-
ously better than the other; in fact, as we'll see in the Appendix A , Peeking Under the
Hood , the two schemas are more similar than we might imagine.
An advantage of the approach we took is that there's exactly one row per day, which is easi-
er to reason about and more naturally fits our intuitions about the data. An advantage of the
one-row-per-client-type approach is that we can introduce new client types without having
to add additional columns to the schema. If we expect that new client types will be intro-
Search WWH ::




Custom Search