Processing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

Chapter 5

Processing Streaming Data

Now that data is flowing through a data collection system, it must be

processed. The original use case for both Kafka and Flume specified Hadoop

as the processing system. Hadoop is, of course, a batch system. Although it is

very good at what it does, it is hard to achieve processing rates with latencies

shorter than about 5 minutes.

The primary source of this limit on the rate of batch processing is startup

and shutdown cost. When a Hadoop job starts, a set of input splits is first

obtained from the input source (usually the Hadoop Distributed File System,

known as HDFS, but potentially other locations). Input splits are parceled

into separate mapper tasks by the Job Tracker, which may involve starting

new virtual machine instances on the worker nodes. Then there is the shuffle,

sort, and reduce phase.

Although each of these steps is fairly small, they add up. A typical job start

time requires somewhere between 10 and 30 seconds of “wall time,”

depending on the nature of the cluster. Hadoop 2 actually adds more time to

the total because it needs to spin up an Application Manager to manage the

job.Forabatchjobthatisgoingtorunfor30minutesoranhour,thisstartup

time is negligible and can be completely ignored for performance tuning. For

a job that is running every 5 minutes, 30 seconds of start time represents a 10

percent loss of performance.

Real-time processing frameworks, the subject of this chapter, get around this

setup and breakdown latency by using long-lived processes and very small

batches (potentially as small as a single record, but usually larger than that).

An individual job may run for hours, days, or weeks before being restarted,

which amortizes the startup costs to zero. The downside of this approach

is that there is some added management overhead to ensure the job runs

properly for a very long period of time and to handle the communication

between components of the job so that multiple machines can be used to

parallelize processing.

The chapter begins with a discussion of the issues affecting the development

ofstreaming dataframeworks. Theproblemofanalyzing streaming dataisan

old one, predating the rise of the current “Big Data” paradigms.

Search WWH ::

Custom Search

Home