Flume - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

prefix. So, a typical temporary filename would be _events.1399295780136.log.tmp ; the

number is a timestamp generated by the HDFS sink.

A file is kept open by the HDFS sink until it has either been open for a given time (default

30 seconds, controlled by the hdfs.rollInterval property), has reached a given

size (default 1,024 bytes, set by hdfs.rollSize ), or has had a given number of events

written to it (default 10, set by hdfs.rollCount ). If any of these criteria are met, the

file is closed and its in-use prefix and suffix are removed. New events are written to a new

file (which will have an in-use prefix and suffix until it is rolled).

After 30 seconds, we can be sure that the file has been rolled and we can take a look at its

contents:

% hadoop fs -cat /tmp/flume/events.1399295780136.log

Hello

Again

The HDFS sink writes files as the user who is running the Flume agent, unless the hd-

fs.proxyUser property is set, in which case files will be written as that user.

Partitioning and Interceptors

Large datasets are often organized into partitions, so that processing can be restricted to

particular partitions if only a subset of the data is being queried. For Flume event data, it's

very common to partition by time. A process can be run periodically that transforms com-

pleted partitions (to remove duplicate events, for example).

It's easy to change the example to store data in partitions by setting hdfs.path to in-

clude subdirectories that use time format escape sequences:

agent1.sinks.sink1.hdfs.path = /tmp/flume/year=%Y/month=%m/day=%d

Here we have chosen to have day-sized partitions, but other levels of granularity are pos-

sible, as are other directory layout schemes. (If you are using Hive, see Partitions and

Buckets for how Hive lays out partitions on disk.) The full list of format escape sequences

is provided in the documentation for the HDFS sink in the Flume User Guide .

The partition that a Flume event is written to is determined by the timestamp header on

the event. Events don't have this header by default, but it can be added using a Flume in-

terceptor . Interceptors are components that can modify or drop events in the flow; they

are attached to sources, and are run on events before the events have been placed in a

channel. [ 92 ] The following extra configuration lines add a timestamp interceptor to

source1 , which adds a timestamp header to every event produced by the source:

Search WWH ::

Custom Search

Home