Database Reference
In-Depth Information
Official Page
http://flume.apache.org
Hadoop Integration Fully Integrated
You have identified data that lives in a feeder system that you'll need in your Hadoop cluster
to do some analysis and now need to find a way to move it there. In general, you cannot use
FTP or SCP, as these transport data between POSIX-compliant filesystems and HDFS is not
POSIX compliant. Some Hadoop distributions, such as the MapR distribution or those that
are certified to use the Isilon OneFS, can accommodate this. You could FTP the data to the
native filesystem on a Hadoop node and then use HDFS commands like copyFromLocal , but
this is tedious and single threaded. Flume to the rescue!
Flume is a reliable distributed system for collecting, aggregating, and moving large amounts
of log data from multiple sources into HDFS. It supports complex multihop flows and fan-in
and fan-out. Events are staged in a channel on each agent and delivered to the next agent in
the chain, finally removed once they reach the next agent or HDFS, the ultimate sink. A
Flume process has a configuration file that list the sources, sinks, and channels for the data
flow. Typical use cases include loading log data into Hadoop.
Tutorial Links
Dr. Dobb's Journal published an informative article on Flume. Readers who enjoy a lecture
should check out this interesting presentation from 2011.
Example Code
To use Flume, you'll first build a configuration file that describes the agent: the source, the
sink, and the channel. Here the source is netcat, a program that echoes output through TCP,
the sink is an HDFS file, and the channel is a memory channel:
# xmpl.conf
# Name the components on this agent
agent1.sources = src1
agent1.sinks = snk1
agent1.channels = chn1
# Describe/configure the source
Search WWH ::




Custom Search