Database Reference
In-Depth Information
Apache Flume NG
Apache Flume NG Version 1.x is a distributed framework that handles the collection and
aggregation of large amounts of log data. This project was primarily built to handle
streaming data. Flume is robust, reliable, and fault tolerant. Though Flume was built to
handle the streaming of log data, its flexibility when handling multiple data sources makes
it easy to configure it to handle event data. Flume can handle almost any kind of data.
Flume performs the operations of collection and aggregation using agents. An agent is
comprised of a source, a channel, and a sink.
Events such as the streaming of log files are fed to the source. There are different types of
Flume sources, which can consume different types of data. After receiving the events, the
sources store the data in channels. A channel is a queue that contains all the data received
from a source. The data is retained in the channel until it is consumed by the sink. The sink
is responsible for taking data from channels and placing it on an external store such as
HDFS.
The following diagram shows the flow of event/log data to HDFS via the agent:
In the preceding diagram, we see a simple data flow where events or logs are provided as
an input to a Flume agent. The source, which is a subcomponent of the agent, forwards the
data to one or more channels. The data from the channel is then taken by the sink and fi-
nally pushed to HDFS. It is important to note that the source and sink of an agent work
asynchronously. The rate at which the data is pushed to the channel and the rate at which
the sink pulls the data from the channel are configured to handle spikes that occur with the
event/log data.
Using Flume, you can configure more complex data flows where the sink from one agent
could be an input to the source of another agent. Such flows are referred to as multi-hop
flows .
Search WWH ::




Custom Search