Database Reference
In-Depth Information
Chapter 5
Processing Streaming Data
Now that data is flowing through a data collection system, it must be
processed. The original use case for both Kafka and Flume specified Hadoop
as the processing system. Hadoop is, of course, a batch system. Although it is
very good at what it does, it is hard to achieve processing rates with latencies
shorter than about 5 minutes.
The primary source of this limit on the rate of batch processing is startup
and shutdown cost. When a Hadoop job starts, a set of input splits is first
obtained from the input source (usually the Hadoop Distributed File System,
known as HDFS, but potentially other locations). Input splits are parceled
into separate mapper tasks by the Job Tracker, which may involve starting
new virtual machine instances on the worker nodes. Then there is the shuffle,
sort, and reduce phase.
Although each of these steps is fairly small, they add up. A typical job start
time requires somewhere between 10 and 30 seconds of “wall time,”
depending on the nature of the cluster. Hadoop 2 actually adds more time to
the total because it needs to spin up an Application Manager to manage the
job.Forabatchjobthatisgoingtorunfor30minutesoranhour,thisstartup
time is negligible and can be completely ignored for performance tuning. For
a job that is running every 5 minutes, 30 seconds of start time represents a 10
percent loss of performance.
Real-time processing frameworks, the subject of this chapter, get around this
setup and breakdown latency by using long-lived processes and very small
batches (potentially as small as a single record, but usually larger than that).
An individual job may run for hours, days, or weeks before being restarted,
which amortizes the startup costs to zero. The downside of this approach
is that there is some added management overhead to ensure the job runs
properly for a very long period of time and to handle the communication
between components of the job so that multiple machines can be used to
parallelize processing.
The chapter begins with a discussion of the issues affecting the development
ofstreaming dataframeworks. Theproblemofanalyzing streaming dataisan
old one, predating the rise of the current “Big Data” paradigms.
Search WWH ::




Custom Search