Database Reference
In-Depth Information
In particular, the scenario we'll explore here again uses data from a web server's access logs.
Using this data, we'll pre-calculate reports on the number of hits to a collection of websites
at various levels of granularity based on time (i.e., by minute, hour, day, week, and month) as
well as by the path of a resource.
To achieve the required performance to support these tasks, we'll use MongoDB's upsert and
increment operations to calculate statistics, allowing simple range-based queries to quickly re-
turn data to support time-series charts of aggregated data.
Schema Design
Schemas for real-time analytics systems must support simple and fast query and update oper-
ations. In particular, we need to avoid the following performance killers:
Individual documents growing significantly after they are created
Document growth forces MongoDB to move the document on disk, slowing things down.
Collection scans
The more documents that MongoDB has to examine to fulfill a query, the less efficient
that query will be.
Documents with a large number (hundreds) of keys
Due to the way MongoDB's internal document storage BSON stores documents, this can
create wide variability in access time to particular values.
Intuitively, you may consider keeping “hit counts” in individual documents with one docu-
ment for every unit of time (minute, hour, day, etc.). However, any query would then need
to visit multiple documents for all nontrivial time-rage queries, which can slow overall query
performance.
A better solution is to store a number of aggregate values in a single document, reducing the
number of overall documents that the query engine must examine to return its results. The
remainder of this section explores several schema designs that you might consider for this
real-time analytics system, before finally settling on one that achieves both good update per-
formance as well as good query performance.
Search WWH ::




Custom Search