Database Reference
In-Depth Information
Why Streaming Data Is Different
There are a number of aspects to streaming data that set it apart from
other kinds of data. The three most important, covered in this section, are
the “always-on” nature of the data, the loose and changing data structure,
and the challenges presented by high-cardinality dimensions. All three play
a major role in decisions made in the design and implementation of the
various streaming frameworks presented in this topic. These features of
streaming data particularly influence the data processing frameworks
presented in Chapter 5. They are also reflected in the design decisions of the
data motion tools, which consciously choose not to impose a data format
on information passing through their system to allow maximum flexibility.
The remainder of this section covers each of these in more depth to provide
some context before diving into Chapter 2, which covers the components
and requirements of a streaming architecture.
Always On, Always Flowing
This first is somewhat obvious: streaming data streams. The data is always
available and new data is always being generated. This has a few effects on
the design of any collection and analysis system. First, the collection itself
needs to be very robust. Downtime for the primary collection system means
that data is permanently lost. This is an important thing to remember when
designing an edge collector, and it is discussed in more detail in Chapter 2.
Second, the fact that the data is always flowing means that the system needs
to be able to keep up with the data. If 2 minutes are required to process 1
minute of data, the system will not be real time for very long. Eventually,
the problem will be so bad that some data will have to be dropped to allow
the system to catch up. In practice it is not enough to have a system that
can merely “keep up” with data in real time. It needs to be able to process
data far more quickly than real time. For reasons that are either intentional,
suchasaplanneddowntime,orduetocatastrophicfailures,suchasnetwork
outages, the system either whole or in part will go down.
Failing to plan for this inevitability and having a system that can only
process at the same speed as events happen means that the system is now
delayed by the amount of data stored at the collectors while the system was
down. A system that can process 1 hour of data in 1 minute, on the other
hand, can catch up fairly quickly with little need for intervention. A mature
Search WWH ::




Custom Search