Flume - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Distribution: Agent Tiers

How do we scale a set of Flume agents? If there is one agent running on every node produ-

cing raw data, then with the setup described so far, at any particular time each file being

written to HDFS will consist entirely of the events from one node. It would be better if we

could aggregate the events from a group of nodes in a single file, since this would result in

fewer, larger files (with the concomitant reduction in pressure on HDFS, and more efficient

processing in MapReduce; see Small files and CombineFileInputFormat ) . Also, if needed,

files can be rolled more often since they are being fed by a larger number of nodes, leading

to a reduction between the time when an event is created and when it's available for analys-

is.

Aggregating Flume events is achieved by having tiers of Flume agents. The first tier col-

lects events from the original sources (such as web servers) and sends them to a smaller set

of agents in the second tier, which aggregate events from the first tier before writing them

to HDFS (see Figure 14-3 ). Further tiers may be warranted for very large numbers of

source nodes.

Search WWH ::

Custom Search

Home