Database Reference
In-Depth Information
Transactions and Reliability
Flume uses separate transactions to guarantee delivery from the source to the channel and
from the channel to the sink. In the example in the previous section, the spooling directory
source creates an event for each line in the file. The source will only mark the file as com-
pleted once the transactions encapsulating the delivery of the events to the channel have
been successfully committed.
Similarly, a transaction is used for the delivery of the events from the channel to the sink. If
for some unlikely reason the events could not be logged, the transaction would be rolled
back and the events would remain in the channel for later redelivery.
The channel we are using is a file channel , which has the property of being durable: once
an event has been written to the channel, it will not be lost, even if the agent restarts.
(Flume also provides a memory channel that does not have this property, since events are
stored in memory. With this channel, events are lost if the agent restarts. Depending on the
application, this might be acceptable. The trade-off is that the memory channel has higher
throughput than the file channel.)
The overall effect is that every event produced by the source will reach the sink. The major
caveat here is that every event will reach the sink at least once — that is, duplicates are
possible. Duplicates can be produced in sources or sinks: for example, after an agent re-
start, the spooling directory source will redeliver events for an uncompleted file, even if
some or all of them had been committed to the channel before the restart. After a restart,
the logger sink will re-log any event that was logged but not committed (which could hap-
pen if the agent was shut down between these two operations).
At-least-once semantics might seem like a limitation, but in practice it is an acceptable per-
formance trade-off. The stronger semantics of exactly once require a two-phase commit
protocol, which is expensive. This choice is what differentiates Flume (at-least-once se-
mantics) as a high-volume parallel event ingest system from more traditional enterprise
messaging systems (exactly-once semantics). With at-least-once semantics, duplicate
events can be removed further down the processing pipeline. Usually this takes the form of
an application-specific deduplication job written in MapReduce or Hive.
Batching
For efficiency, Flume tries to process events in batches for each transaction, where pos-
sible, rather than one by one. Batching helps file channel performance in particular, since
every transaction results in a local disk write and fsync call.
Search WWH ::




Custom Search