Designing Real-Time Streaming Architectures - Real-Time Analytics

Database Reference

In-Depth Information

The primary drawback of the log-based systems is that they are fairly slow.

Data must be collected in a batch form when a log file is “rolled” and

processed en masse. Second-generation data-flow systems recognized that

reliable transport was not always a priority and began to implement remote

procedure call (RPC) systems for moving data between systems. Although

they may have some buffering to improve reliability, the second-generation

systems, such as Scribe and Avro, generally accept that speed is acquired

at the expense of reliability. For many applications this tradeoff is wholly

acceptable and has been made for decades in systems-monitoring software

such as Syslog, which uses a similar model.

Third-generation systems combine the reliability of the first-generation log

models with the speed of the second-generation RPC models. In these

systems, there is a real-time interface to the data layer for both producers

and consumers of the data that delivers data as discrete messages, rather

than the bulk delivery found in first-generation log systems. In practice,

this is usually accomplished as a “mini batch” on the order of a thousand

messages to improve performance.

However, these environments also implement an intermediate storage layer

that allows them to make the same “at least once” delivery guarantees of

log-based delivery systems. To maintain the requisite performance, this

storage layer is horizontally scaled across several different physical systems

with coordination handled by the client software on both the producer and

consumer sides of the system.

The first of these systems were queuing systems designed to handle large

data loads; ActiveMQ is an example. By providing a queuing paradigm, the

queuing systems allow for the development of message “busses” that loosely

coupled different components of the architecture and free developers from

the communication task. The drawback of queuing systems has been the

desire to maintain queue semantics where the order of delivery to

consumers is matched to the order of submission. Generally, this behavior

is unnecessary in distributed systems and, if needed, usually better handled

by the client software.

Recognition of the fact that queuing semantics are mostly unneeded has led

the latest entrants in the third-generation of data-flow systems, Kafka and

Flume, to largely abandon the ordering semantics while still maintaining

the distributed system and reliability guarantees. This has allowed them to

boost performance for nearly all applications. Kafka is also notable in that

Search WWH ::

Custom Search

Home