Processing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

Two frameworks for implementing streaming systems are discussed: Storm

and Samza. Both are complete processing frameworks and are covered in

depth in this chapter. In both cases, you can construct a server

infrastructure that allows for distributed processing as well as

well-structured application programing interfaces (APIs) that describe how

computations are moved around these servers.

Distributed Streaming Data Processing

Before delving into the specifics of implementing streaming processing

applications within a particular framework, it helps to spend some time

understanding how distributed stream processing works. Stream processing

is ideal for certain problems, but certain requirements can affect

performance to the point that it would actually be faster to use a batch

system. This is particularly true of processes requiring coordination among

units.

This section provides an introduction to the properties of stream processing

systems. At its heart, stream processing is a specialized form of parallel

computing, designed to handle the case where data may be processed only

one time. However, unlike the days of monolithic supercomputers, most of

these parallel-computing environments are implemented on an unreliable

networking layer that introduces a much higher error rate.

Coordination

The core of any streaming framework is some sort of coordination server.

This coordination server stores information about the topology of the

process—which components should get what data and what physical

address they occupy.

The coordination server also maintains a distributed lock server for

handlingsomeofthepartitioningtasks.Inparticular,alockserverisneeded

for the initial load of data into the distributed processing framework. Data

can often be read far faster than it can be processed, so even the internal

partitioning of the data motion system might need to use more than one

process to pull from a given partition. This requires that the clients

coordinate among themselves.

Both of the frameworks discussed in this chapter use ZooKeeper to handle

their coordination tasks. Depending on load, it is possible to reuse the

Search WWH ::

Custom Search

Home