Cloudera's Distribution Including Apache Hadoop - Cloudera Administration

Database Reference

In-Depth Information

Apache Flume NG

Apache Flume NG Version 1.x is a distributed framework that handles the collection and

aggregation of large amounts of log data. This project was primarily built to handle

streaming data. Flume is robust, reliable, and fault tolerant. Though Flume was built to

handle the streaming of log data, its flexibility when handling multiple data sources makes

it easy to configure it to handle event data. Flume can handle almost any kind of data.

Flume performs the operations of collection and aggregation using agents. An agent is

comprised of a source, a channel, and a sink.

Events such as the streaming of log files are fed to the source. There are different types of

Flume sources, which can consume different types of data. After receiving the events, the

sources store the data in channels. A channel is a queue that contains all the data received

from a source. The data is retained in the channel until it is consumed by the sink. The sink

is responsible for taking data from channels and placing it on an external store such as

HDFS.

The following diagram shows the flow of event/log data to HDFS via the agent:

In the preceding diagram, we see a simple data flow where events or logs are provided as

an input to a Flume agent. The source, which is a subcomponent of the agent, forwards the

data to one or more channels. The data from the channel is then taken by the sink and fi-

nally pushed to HDFS. It is important to note that the source and sink of an agent work

asynchronously. The rate at which the data is pushed to the channel and the rate at which

the sink pulls the data from the channel are configured to handle spikes that occur with the

event/log data.

Using Flume, you can configure more complex data flows where the sink from one agent

could be an input to the source of another agent. Such flows are referred to as multi-hop

flows .

Search WWH ::

Custom Search

Home