Database Reference
In-Depth Information
in Brooklyn on a Sunday night. To make things more difficult, the data is
encoded as human language.
Another reason is that there is a “kitchen sink” mentality to streaming
data projects. Most of the projects are fairly young and exploring unknown
territory, so it makes sense to toss as many different dimensions into the
data as possible. This is likely to change over time, so the decision is also
made to use a format that can be easily modified, such as JavaScript Object
Notation (JSON). The general paradigm is to collect as much data as
possible in the event that it is actually interesting.
Finally, the real-time nature of the data collection also means that various
dimensions may or may not be available at any given time. For example,
a service that converts IP addresses to a geographical location may be
temporarily unavailable. For a batch system this does not present a
problem; the analysis can always be redone later when the service is once
more available. The streaming system, on the other hand, must be able to
deal with changes in the available dimensions and do the best it can.
High-Cardinality Storage
Cardinality refers to the number of unique values a piece of data can take
on. Formally, cardinality refers to the size of a set and can be applied to
the various dimensions of a dataset as well as the entire dataset itself. This
high cardinality often manifests itself in a “long tail” feature of the data. For
a given dimension (or combination of dimensions) there is a small set of
different states that are quite common, usually accounting for the majority
oftheobserveddata,andthena“longtail”ofotherdatastatesthatcomprise
a fairly small fraction.
This feature is common to both streaming and batch systems, but it is much
harder to deal with high cardinality in the streaming setting. In the batch
setting it is usually possible to perform multiple passes over the dataset.
A first pass over the data is often used to identify dimensions with high
cardinality and compute the states that make up most of the data. These
common states can be treated individually, and the remaining state is
combined into a single “other” state that can usually be ignored.
In the streaming setting, the data can usually be processed a single time.
If the common cases are known ahead of time, this can be included in the
processing step. The long tail can also be combined into the “other” state,
Search WWH ::




Custom Search