Database Reference
In-Depth Information
environment that has good horizontal scalability—a concept also discussed
in Chapter 2—can even implement auto-scaling. In this setting, as the delay
increases, more processing power is temporarily added to bring the delay
back into acceptable limits.
On the algorithmic side, this always-flowing feature of streaming data is a
bit of a double-edged sword. On the positive side, there is rarely a situation
where there is not enough data. If more data is required for an analysis,
simply wait for enough data to become available. It may require a long wait,
but other analyses can be conducted in the meantime that can provide early
indicators of how the later analysis might proceed.
On the downside, much of the statistical tooling that has been developed
over the last 80 or so years is focused on the discrete experiment. Many
of the standard approaches to analysis are not necessarily well suited to
the data when it is streaming. For example, the concept of “statistical
significance” becomes an odd sort of concept when used in a streaming
context. Many see it as some sort of “stopping rule” for collecting data, but
it does not actually work like that. The p-value statistic used to make the
significance call is itself a random value and may dip below the critical value
(usually 0.05) even though, when the next value is observed, it would result
in a p-value above 0.05.
This does not mean that statistical techniques cannot and should not be
used—quite the opposite. They still represent the best tools available for
the analysis of noisy data. It is simply that care should be taken when
performing the analysis as the prevailing dogma is mostly focused on
discrete experiments.
Loosely Structured
Streaming dataisoftenlooselystructuredcompared tomanyotherdatasets.
There are several reasons this happens, and although this loose structure is
not unique to streaming data, it seems to be more common in the streaming
settings than in other situations.
Part of the reason seems to be the type of data that is interesting in the
streamingsetting.Streamingdatacomesfromavarietyofsources.Although
some of these sources are rigidly structured, many of them are carrying an
arbitrary data payload. Social media streams, in particular, will be carrying
dataabouteverythingfromworldeventstothebestsliceofpizzatobefound
Search WWH ::




Custom Search