Database Reference
In-Depth Information
agent99.kitchen.hdfs.roundValue = 5
agent99.kitchen.hdfs.roundUnit = minute
agent99.kitchen.hdfs.useLocalTimeStamp = false
In this example, the Hadoop Distributed File System (HDFS) sink is
instructed to create paths using the local timestamp rounded down to the
nearest 5-minute interval. The timestamp is obtained from the event itself
by looking for a timestamp header.
Event versus Processing Time
It is tempting to use the event timestamp during this sort of ingest process
to neatly place events into directories that make later analysis easier.
Unfortunately, doing this makes continuous process pipelines vastly more
difficult to implement. In a processing environment, data may be arriving
out of order. This means that one has to track where all the data ended up
after an ingestion sequence. Keeping the data organized by the import time
makes it very clear when the data was imported.
The reason it is important to keep track of the import time is that this is
what has to be fixed when a bug is introduced. If the processing pipeline
breaks, this import time identifies the data that has to be fixed, not the time
whenthedatawasgenerated.Thetimeoftheeventdoesnotmatterformost
recoveryandmaintenance situations,despitebeingslightlyeasiertoprocess
for users looking at historical data.
To use the local timestamp in Flume, simply change the
useLocalTimeStamp parameter in the configuration to true .
Doing the same for Camus is a bit more complicated. The easiest way to
do it is to choose an output directory before the job based on the current
time. This can be passed in a configuration parameter and read by a custom
Partitioner class:
public class BatchPartitioner implements Partitioner {
String batch = null ;
@Override
public String encodePartition(JobContext context,
IEtlKey etlKey) {
if (batch == null )
batch =
Search WWH ::




Custom Search