Designing Real-Time Streaming Architectures - Real-Time Analytics

Database Reference

In-Depth Information

much time on this part of the environment except to describe the

mechanisms for writing directly to the data-flow component.

Data Flow

Collection, analysis, and reporting systems, with few exceptions, scale and

grow at different rates within an organization. For example, if incoming

traffic remains stable, but depth of analysis grows, then the analysis

infrastructure needs more resources despite the fact that the amount of data

collected stays the same. To allow for this, the infrastructure is separated

into tiers of collection, processing, and so on. Many times, the

communication between these tiers is conducted on an ad hoc basis, with

each application in the environment using its own communication method

to integrate with its other tiers.

One of the aims of a real-time architecture is to unify the environment,

at least to some extent, to allow for the more modular construction of

applications and their analysis. A key part of this is the data-flow system

(also called a data motion system in this topic).

These systems replace the ad hoc, application-specific, communication

framework with a single, unified software system. The replacement software

systems are usually distributed systems, allowing them to expand and

handle complicated situations such as multi-datacenter deployment, but

they expose a common interface to both producers and consumers of the

data.

The systems discussed in this topic are primarily what might be considered

third-generation systems. The “zero-th generation” systems are the closely

coupled ad hoc communication systems used to separate applications into

application-specific tiers.

The first generation systems break this coupling, usually using some sort

of log-file system to collect application-specific data into files. These files

are then generically collected to a central processing location. Custom

processors then consume these files to implement the other tiers. This has

been, by far, the most popular system because it can be made reliable by

implementing“atleastonce”deliverysemanticsandbecauseit'sfastenough

for batch processing applications. The original Hadoop environments were

essentially optimized for this use case.

Search WWH ::

Custom Search

Home