Introduction to Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

in Brooklyn on a Sunday night. To make things more difficult, the data is

encoded as human language.

Another reason is that there is a “kitchen sink” mentality to streaming

data projects. Most of the projects are fairly young and exploring unknown

territory, so it makes sense to toss as many different dimensions into the

data as possible. This is likely to change over time, so the decision is also

made to use a format that can be easily modified, such as JavaScript Object

Notation (JSON). The general paradigm is to collect as much data as

possible in the event that it is actually interesting.

Finally, the real-time nature of the data collection also means that various

dimensions may or may not be available at any given time. For example,

a service that converts IP addresses to a geographical location may be

temporarily unavailable. For a batch system this does not present a

problem; the analysis can always be redone later when the service is once

more available. The streaming system, on the other hand, must be able to

deal with changes in the available dimensions and do the best it can.

High-Cardinality Storage

Cardinality refers to the number of unique values a piece of data can take

on. Formally, cardinality refers to the size of a set and can be applied to

the various dimensions of a dataset as well as the entire dataset itself. This

high cardinality often manifests itself in a “long tail” feature of the data. For

a given dimension (or combination of dimensions) there is a small set of

different states that are quite common, usually accounting for the majority

oftheobserveddata,andthena“longtail”ofotherdatastatesthatcomprise

a fairly small fraction.

This feature is common to both streaming and batch systems, but it is much

harder to deal with high cardinality in the streaming setting. In the batch

setting it is usually possible to perform multiple passes over the dataset.

A first pass over the data is often used to identify dimensions with high

cardinality and compute the states that make up most of the data. These

common states can be treated individually, and the remaining state is

combined into a single “other” state that can usually be ignored.

In the streaming setting, the data can usually be processed a single time.

If the common cases are known ahead of time, this can be included in the

processing step. The long tail can also be combined into the “other” state,

Search WWH ::

Custom Search

Home