Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

Real-Time Unique Visitor Pivot Tables

In Chapter 6, “Storing Streaming Data,” a compressed bitmap was used

to efficiently store the data to create pivot tables for the population of

page views, clicks, and so on for a website. This is somewhat

interesting, but it is often more interesting to talk about the audience of

a website rather than its page views.

With compressed bitmaps, tracking unique users wasn't possible

because adding new things is an append-only operation. Because of

this, a monotonically increasing counter is required. In the original

example, the order of arrival was chosen as this counter. In contrast,

the unique visitors over some time period do not appear in any

particular order and will likely even appear multiple times over that

time period.

Instead, HyperLogLog sketches can be used in place of the compressed

bitmap and the inclusion-exclusion principle used to estimate the

intersections required by the pivot table. To begin, recall the input data

for a page view on the website:

timestamp,user

id,feature1:value,feature2:value,...,featureN:value

Each input record contains a user ID, which will be used as the element

to enter into the HyperLogLog sketch. It also includes some number of

feature elements that define demographic information about the user

or other information about the page they are visiting (for example, the

site section or the product page).

Essentially, HyperLogLog sketches for each of the feature:value

combinations need to be maintained, containing the approximate

number of unique user IDs associated with each. To build a pivot table

for any two features A and B, simply retrieve the sketch for all the

possible values of A: featureA:value1, featureA:value2, …,

featureA:valueN and all of the sketches for the possible values of

feature B: featureB:value1, featureB:value2,…,featureB:valueN. Then

compute the intersection of each featureA:valueX, featureB:valueY

Search WWH ::

Custom Search

Home