Database Reference
In-Depth Information
Chapter 4
Data-Flow Management in Streaming Analysis
Chapter 3, “Service Configuration and Coordination,” introduces the concept
and difficulties of maintaining a distributed state. One of the most common
reasons to require this distributed state is the collection and processing of
data in a scalable way.
Distributed data flows, which include processing and collection, have been
around a long time. Generally, the systems designed to handle this task have
been bespoke applications developed either in-house or through consulting
agreements. More recently, the technologies used to implement these data
flow systems has reached the point of common infrastructure. Data flow
systems can be split into a separate service in much the same way that
coordination and configuration can. They are now general enough in their
interfaces and their assumptions that they can be used outside of their
originally intended applications.
The earliest of these systems were arguably the queuing systems, such as
ActiveMQ,whichstartedtocomeontothesceneintheearly2000s.However,
they were not really designed for high-throughput volumes (although many
of them can now achieve fairly good performance) and tended to be very Java
centric.
The next systems on the scene were those open-sourced by the large Internet
companies such as Facebook. One of the most well-known systems of this
generation was a tool called Scribe, which was released in 2008. It used an
RPC-like mechanism to concentrate data from edge servers into a processing
framework like Hadoop. Scribe has many of the same features of the current
generation, including the ability to spool data to disk, but it can only account
for intermittent connectivity failures.
Flume, developed by Cloudera, and Kafka are the current generation of
distributed data collection systems, and they represent two entirely separate
philosophies. This chapter discusses the care and feeding of both of these
data motion systems. In addition, there is some discussion of their
underlying philosophies to help make the decision about which system to use
in which situation. However, these two data motion systems should not be
Search WWH ::




Custom Search