Aggregating Time-Series Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Summary

In this chapter, we explored strategies for aggregating observed time-series data—in this

case user behavior in viewing status updates in our application. While user behavior analyt-

ics are a fantastic and common use case for Cassandra, we could also take the same ap-

proach to aggregate scientific data, economic data, or anything else where we'd like to roll

up discrete observations into high-level aggregate values.

Our structure for recording time-series data used a table containing discrete observations as

the raw material and acting as the data record in case we want to introduce new aggregate

dimensions down the line. We also used a table that precomputed aggregate observations

by day; by keeping the aggregate up-to-date at write time, we built a structure that allows

us to very efficiently retrieve aggregates over a given time period, without any expensive

computation at read time. We can easily imagine constructing dozens of such tables, one

for each level of granularity at which we would like to analyze aggregate information.

We explored using counter columns to effortlessly maintain the precomputed aggregates;

each time we made an observation, we simply issued an upsert to increment the relevant

counter columns; this allowed us to record observations simply by issuing a series of

UPDATE statements, without having to read the current aggregate values from Cassandra

first.

We saw that, while counter columns are a useful tool for precomputed data aggregation,

they also have their downsides. Counter columns do not allow us to directly set values, we

can only increment or decrement them; because deletion of a counter column value is per-

manent, this operation is of little use in a counter column table. We saw that counter

columns can coexist in a table only with other counter columns; they can't be in the same

table as other data columns or collection columns.

In the next chapter, we will look more deeply into how Cassandra stores and retrieves data,

with particular focus on how data is distributed among multiple machines in a multinode

cluster, the typical configuration of a production Cassandra deployment. You'll learn how

Cassandra handles conflicting updates to the same piece of data using timestamps, and

you'll see how we can override those timestamps to interesting effect. You'll also learn

more about what happens when data is deleted from Cassandra, and use that knowledge to

avoid common pitfalls with data deletion.

Search WWH ::

Custom Search

Home