Aggregating Time-Series Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Precomputation versus read-time aggregation

Now we have a table that's optimized to aggregate our analytics data in a specific way. The

advantage of this approach is that we can now answer questions about daily views very ef-

ficiently; the downside is that we need to create and maintain a table just to aggregate the

data in this one way. Each time we record a view, we need to update both the

status_update_views and daily_status_update_views tables. We can eas-

ily imagine dozens of different ways in which we might want to aggregate analytics data,

each requiring its own purpose-built table, each needing to be updated when an observation

is made.

Cassandra is well suited to this sort of precomputed aggregation because of properties

that we have explored in previous chapters. As we can horizontally scale our dataset by

simply adding more machines to our cluster, storing many different aggregates of the same

underlying observations isn't hugely expensive. As Cassandra is extremely efficient at writ-

ing data, it isn't a deal-breaker to have to update several different tables when we make an

observation. The tradeoff is analogous to the one we considered in our exploration of data

denormalization in Chapter 6 , Denormalizing Data for Maximum Performance : by keeping

multiple views of the same underlying data, we increase the complexity of writing data, but

give ourselves very efficient structures from which to read that data.

An alternative to precomputation is aggregating data at read time. SQL databases give us a

substantial toolset for performing ad hoc aggregation of data when reading it back, using

aggregate functions like SUM , AVERAGE , MININUM , MAXIMUM , and so on, combined with

a GROUP BY clause to perform the aggregations at the desired level of granularity. CQL

does not offer built-in aggregation functionality; however, Cassandra is capable of integrat-

ing with the Hadoop MapReduce framework, which provides an efficient means of per-

forming aggregate computations over massive datasets. DataStax Enterprise, a commercial

package that bundles Cassandra, Hadoop, and the Solr search engine, is worth exploring if

you need to aggregate data in an ad hoc way.

Search WWH ::

Custom Search

Home