Aggregating Time-Series Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Answering the right question

Recall that the job of this table is to be able to answer questions such as, "How many total

status update views happened on each day of September?". In order to do so, we need to

store view counts aggregated daily, and we need to be able to query for ranges of days.

By forming a primary key out of the year, month, and day on which the views were coun-

ted, we aggregate at the right granularity: there will be one row per day, containing the

counts for that day. We use the year of the observation as the partition key, thus giving

ourselves the ability to retrieve a series of daily counts for any range of days within the

same year. In the event that we want to retrieve a series of daily counts for a range that

spans multiple years, our application will need to make multiple queries—one for each year

we're interested in— and then stitch together the results on the client side.

We use a timestamp column as the clustering column, storing the day we're counting views

on. The information in the year partition key is redundant with the year component of the

clustering column, so we'll never need to read it directly, but year-based partitions still al-

low us to keep the size of each partition reasonable and bounded.

For instance, if we want to retrieve total view counts for each day over the month of

September, it's as simple as this:

SELECT "date", "total_views"

FROM "daily_status_update_views"

WHERE "year" = 2014

AND "date" >= '2014-09-01'

AND "date" < '2014-09-30';

Another fact about the schema that's worth noting is the use of four data columns: a total

count, and three columns for client-specific counts. Another way to model the same in-

formation would have been to add another clustering column called, say, client_type ;

then, for each day, we'd have four rows—one per client type. One approach is not obvi-

ously better than the other; in fact, as we'll see in the Appendix A , Peeking Under the

Hood , the two schemas are more similar than we might imagine.

An advantage of the approach we took is that there's exactly one row per day, which is easi-

er to reason about and more naturally fits our intuitions about the data. An advantage of the

one-row-per-client-type approach is that we can introduce new client types without having

to add additional columns to the schema. If we expect that new client types will be intro-

Search WWH ::

Custom Search

Home