Data Transfer - Field Guide to Hadoop

Database Reference

In-Depth Information

Official Page

http://flume.apache.org

Hadoop Integration Fully Integrated

You have identified data that lives in a feeder system that you'll need in your Hadoop cluster

to do some analysis and now need to find a way to move it there. In general, you cannot use

FTP or SCP, as these transport data between POSIX-compliant filesystems and HDFS is not

POSIX compliant. Some Hadoop distributions, such as the MapR distribution or those that

are certified to use the Isilon OneFS, can accommodate this. You could FTP the data to the

native filesystem on a Hadoop node and then use HDFS commands like copyFromLocal , but

this is tedious and single threaded. Flume to the rescue!

Flume is a reliable distributed system for collecting, aggregating, and moving large amounts

of log data from multiple sources into HDFS. It supports complex multihop flows and fan-in

and fan-out. Events are staged in a channel on each agent and delivered to the next agent in

the chain, finally removed once they reach the next agent or HDFS, the ultimate sink. A

Flume process has a configuration file that list the sources, sinks, and channels for the data

flow. Typical use cases include loading log data into Hadoop.

Tutorial Links

Dr. Dobb's Journal published an informative article on Flume. Readers who enjoy a lecture

should check out this interesting presentation from 2011.

Example Code

To use Flume, you'll first build a configuration file that describes the agent: the source, the

sink, and the channel. Here the source is netcat, a program that echoes output through TCP,

the sink is an HDFS file, and the channel is a memory channel:

# xmpl.conf

# Name the components on this agent

agent1.sources = src1

agent1.sinks = snk1

agent1.channels = chn1

# Describe/configure the source

Search WWH ::

Custom Search

Home