Database Reference
In-Depth Information
considered to be mutually exclusive. There is no reason that the two cannot
both be used in a single environment according to need.
Distributed Data Flows
A distributed data flow system has two fundamental properties that should
be addressed. The first is an “at least once” delivery semantic. The second
is solving the “n+1” delivery problem. Without these, a distributed data
flow will have difficulty successfully scaling. This section covers these two
components and why they are so important to a distributed data flow.
At Least Once Delivery
There are three options for data delivery and processing in any sort of data
collection framework:
• At most once delivery
• At least once delivery
• Exactly once delivery
Many processing frameworks, particularly those used for system
monitoring, provide “at most once” delivery and processing semantics.
Largely, this is because the situations they were designed to handle do
not require all the data be transmitted, but they do require maximum
performance to alert administrators to problems. In fact, many of these
systems down-sample the data to further improve performance. As long as
the rate of data loss is approximately known, the monitoring software can
recover a usable value during processing.
In other systems—for instance financials systems or advertising systems
where logs are used to determine fees—every lost data record means lost
revenue. Furthermore, audit requirements often mean that this data loss
cannot be estimated through techniques used in the monitoring space. In
this case, most implementations turn to “exactly once” delivery through
queuing systems. Popular examples include the Apache project's ActiveMQ
queuing system, as well as RabbitMQ, along with innumerable commercial
solutions. These servers usually implement their queue semantics on the
server side, primarily because they are usually designed to support a variety
of producers and consumers in an Enterprise setting.
Search WWH ::




Custom Search