Storing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

agent99.kitchen.hdfs.roundValue = 5

agent99.kitchen.hdfs.roundUnit = minute

agent99.kitchen.hdfs.useLocalTimeStamp = false

In this example, the Hadoop Distributed File System (HDFS) sink is

instructed to create paths using the local timestamp rounded down to the

nearest 5-minute interval. The timestamp is obtained from the event itself

by looking for a timestamp header.

Event versus Processing Time

It is tempting to use the event timestamp during this sort of ingest process

to neatly place events into directories that make later analysis easier.

Unfortunately, doing this makes continuous process pipelines vastly more

difficult to implement. In a processing environment, data may be arriving

out of order. This means that one has to track where all the data ended up

after an ingestion sequence. Keeping the data organized by the import time

makes it very clear when the data was imported.

The reason it is important to keep track of the import time is that this is

what has to be fixed when a bug is introduced. If the processing pipeline

breaks, this import time identifies the data that has to be fixed, not the time

whenthedatawasgenerated.Thetimeoftheeventdoesnotmatterformost

recoveryandmaintenance situations,despitebeingslightlyeasiertoprocess

for users looking at historical data.

To use the local timestamp in Flume, simply change the

useLocalTimeStamp parameter in the configuration to true .

Doing the same for Camus is a bit more complicated. The easiest way to

do it is to choose an output directory before the job based on the current

time. This can be passed in a configuration parameter and read by a custom

Partitioner class:

public class BatchPartitioner implements Partitioner {

String batch = null ;

@Override

public String encodePartition(JobContext context,

IEtlKey etlKey) {

if (batch == null )

batch =

Search WWH ::

Custom Search

Home