Database Reference
In-Depth Information
Precomputation versus read-time aggregation
Now we have a table that's optimized to aggregate our analytics data in a specific way. The
advantage of this approach is that we can now answer questions about daily views very ef-
ficiently; the downside is that we need to create and maintain a table just to aggregate the
data in this one way. Each time we record a view, we need to update both the
status_update_views and daily_status_update_views tables. We can eas-
ily imagine dozens of different ways in which we might want to aggregate analytics data,
each requiring its own purpose-built table, each needing to be updated when an observation
is made.
Cassandra is well suited to this sort of precomputed aggregation because of properties
that we have explored in previous chapters. As we can horizontally scale our dataset by
simply adding more machines to our cluster, storing many different aggregates of the same
underlying observations isn't hugely expensive. As Cassandra is extremely efficient at writ-
ing data, it isn't a deal-breaker to have to update several different tables when we make an
observation. The tradeoff is analogous to the one we considered in our exploration of data
denormalization in Chapter 6 , Denormalizing Data for Maximum Performance : by keeping
multiple views of the same underlying data, we increase the complexity of writing data, but
give ourselves very efficient structures from which to read that data.
An alternative to precomputation is aggregating data at read time. SQL databases give us a
substantial toolset for performing ad hoc aggregation of data when reading it back, using
aggregate functions like SUM , AVERAGE , MININUM , MAXIMUM , and so on, combined with
a GROUP BY clause to perform the aggregations at the desired level of granularity. CQL
does not offer built-in aggregation functionality; however, Cassandra is capable of integrat-
ing with the Hadoop MapReduce framework, which provides an efficient means of per-
forming aggregate computations over massive datasets. DataStax Enterprise, a commercial
package that bundles Cassandra, Hadoop, and the Solr search engine, is worth exploring if
you need to aggregate data in an ad hoc way.
Search WWH ::




Custom Search