Database Reference
In-Depth Information
Two frameworks for implementing streaming systems are discussed: Storm
and Samza. Both are complete processing frameworks and are covered in
depth in this chapter. In both cases, you can construct a server
infrastructure that allows for distributed processing as well as
well-structured application programing interfaces (APIs) that describe how
computations are moved around these servers.
Distributed Streaming Data Processing
Before delving into the specifics of implementing streaming processing
applications within a particular framework, it helps to spend some time
understanding how distributed stream processing works. Stream processing
is ideal for certain problems, but certain requirements can affect
performance to the point that it would actually be faster to use a batch
system. This is particularly true of processes requiring coordination among
units.
This section provides an introduction to the properties of stream processing
systems. At its heart, stream processing is a specialized form of parallel
computing, designed to handle the case where data may be processed only
one time. However, unlike the days of monolithic supercomputers, most of
these parallel-computing environments are implemented on an unreliable
networking layer that introduces a much higher error rate.
Coordination
The core of any streaming framework is some sort of coordination server.
This coordination server stores information about the topology of the
process—which components should get what data and what physical
address they occupy.
The coordination server also maintains a distributed lock server for
handlingsomeofthepartitioningtasks.Inparticular,alockserverisneeded
for the initial load of data into the distributed processing framework. Data
can often be read far faster than it can be processed, so even the internal
partitioning of the data motion system might need to use more than one
process to pull from a given partition. This requires that the clients
coordinate among themselves.
Both of the frameworks discussed in this chapter use ZooKeeper to handle
their coordination tasks. Depending on load, it is possible to reuse the
Search WWH ::




Custom Search