Database Reference
In-Depth Information
Real-Time Unique Visitor Pivot Tables
In Chapter 6, “Storing Streaming Data,” a compressed bitmap was used
to efficiently store the data to create pivot tables for the population of
page views, clicks, and so on for a website. This is somewhat
interesting, but it is often more interesting to talk about the audience of
a website rather than its page views.
With compressed bitmaps, tracking unique users wasn't possible
because adding new things is an append-only operation. Because of
this, a monotonically increasing counter is required. In the original
example, the order of arrival was chosen as this counter. In contrast,
the unique visitors over some time period do not appear in any
particular order and will likely even appear multiple times over that
time period.
Instead, HyperLogLog sketches can be used in place of the compressed
bitmap and the inclusion-exclusion principle used to estimate the
intersections required by the pivot table. To begin, recall the input data
for a page view on the website:
timestamp,user
id,feature1:value,feature2:value,...,featureN:value
Each input record contains a user ID, which will be used as the element
to enter into the HyperLogLog sketch. It also includes some number of
feature elements that define demographic information about the user
or other information about the page they are visiting (for example, the
site section or the product page).
Essentially, HyperLogLog sketches for each of the feature:value
combinations need to be maintained, containing the approximate
number of unique user IDs associated with each. To build a pivot table
for any two features A and B, simply retrieve the sketch for all the
possible values of A: featureA:value1, featureA:value2, …,
featureA:valueN and all of the sketches for the possible values of
feature B: featureB:value1, featureB:value2,…,featureB:valueN. Then
compute the intersection of each featureA:valueX, featureB:valueY
Search WWH ::




Custom Search