Database Reference
In-Depth Information
The primary drawback of the log-based systems is that they are fairly slow.
Data must be collected in a batch form when a log file is “rolled” and
processed en masse. Second-generation data-flow systems recognized that
reliable transport was not always a priority and began to implement remote
procedure call (RPC) systems for moving data between systems. Although
they may have some buffering to improve reliability, the second-generation
systems, such as Scribe and Avro, generally accept that speed is acquired
at the expense of reliability. For many applications this tradeoff is wholly
acceptable and has been made for decades in systems-monitoring software
such as Syslog, which uses a similar model.
Third-generation systems combine the reliability of the first-generation log
models with the speed of the second-generation RPC models. In these
systems, there is a real-time interface to the data layer for both producers
and consumers of the data that delivers data as discrete messages, rather
than the bulk delivery found in first-generation log systems. In practice,
this is usually accomplished as a “mini batch” on the order of a thousand
messages to improve performance.
However, these environments also implement an intermediate storage layer
that allows them to make the same “at least once” delivery guarantees of
log-based delivery systems. To maintain the requisite performance, this
storage layer is horizontally scaled across several different physical systems
with coordination handled by the client software on both the producer and
consumer sides of the system.
The first of these systems were queuing systems designed to handle large
data loads; ActiveMQ is an example. By providing a queuing paradigm, the
queuing systems allow for the development of message “busses” that loosely
coupled different components of the architecture and free developers from
the communication task. The drawback of queuing systems has been the
desire to maintain queue semantics where the order of delivery to
consumers is matched to the order of submission. Generally, this behavior
is unnecessary in distributed systems and, if needed, usually better handled
by the client software.
Recognition of the fact that queuing semantics are mostly unneeded has led
the latest entrants in the third-generation of data-flow systems, Kafka and
Flume, to largely abandon the ordering semantics while still maintaining
the distributed system and reliability guarantees. This has allowed them to
boost performance for nearly all applications. Kafka is also notable in that
Search WWH ::




Custom Search