Database Reference
In-Depth Information
prefix. So, a typical temporary filename would be _events.1399295780136.log.tmp ; the
number is a timestamp generated by the HDFS sink.
A file is kept open by the HDFS sink until it has either been open for a given time (default
30 seconds, controlled by the hdfs.rollInterval property), has reached a given
size (default 1,024 bytes, set by hdfs.rollSize ), or has had a given number of events
written to it (default 10, set by hdfs.rollCount ). If any of these criteria are met, the
file is closed and its in-use prefix and suffix are removed. New events are written to a new
file (which will have an in-use prefix and suffix until it is rolled).
After 30 seconds, we can be sure that the file has been rolled and we can take a look at its
contents:
% hadoop fs -cat /tmp/flume/events.1399295780136.log
Hello
Again
The HDFS sink writes files as the user who is running the Flume agent, unless the hd-
fs.proxyUser property is set, in which case files will be written as that user.
Partitioning and Interceptors
Large datasets are often organized into partitions, so that processing can be restricted to
particular partitions if only a subset of the data is being queried. For Flume event data, it's
very common to partition by time. A process can be run periodically that transforms com-
pleted partitions (to remove duplicate events, for example).
It's easy to change the example to store data in partitions by setting hdfs.path to in-
clude subdirectories that use time format escape sequences:
agent1.sinks.sink1.hdfs.path = /tmp/flume/year=%Y/month=%m/day=%d
Here we have chosen to have day-sized partitions, but other levels of granularity are pos-
sible, as are other directory layout schemes. (If you are using Hive, see Partitions and
Buckets for how Hive lays out partitions on disk.) The full list of format escape sequences
is provided in the documentation for the HDFS sink in the Flume User Guide .
The partition that a Flume event is written to is determined by the timestamp header on
the event. Events don't have this header by default, but it can be added using a Flume in-
terceptor . Interceptors are components that can modify or drop events in the flow; they
are attached to sources, and are run on events before the events have been placed in a
channel. [ 92 ] The following extra configuration lines add a timestamp interceptor to
source1 , which adds a timestamp header to every event produced by the source:
Search WWH ::




Custom Search