Data-Flow Management in Streaming Analysis - Real-Time Analytics

Database Reference

In-Depth Information

The addition of replication to Kafka also introduced some changes to the

Kafka Producer API. In versions of Kafka prior to 0.8 there was no

acknowledgement in the Producer API. Applications wrote to the Kafka

socket and hoped for the best. In Kafka 0.8, there are now three different

levels of acknowledgements available: none, leader, and all.

The first option, none, is the same as in Kafka 0.7 and earlier and no

response is returned to producer. This is the least-durable situation and

allows data to be lost, but it affords maximum performance that can be

easily measured into the tens of thousands of messages per second.

The second option, leader, sends an acknowledgement after the leader has

received the message but before it has received acknowledgements from the

ISR. This reduces performance somewhat and can still lead to data loss, but

this option offers a reasonable level of durability for most applications.

The final option, all, sends the acknowledgement only after the leader has

committed the message. In this situation, the data is not lost so long as at

least one partition remains in the ISR. However, the performance reduction

relative to the none case is significant, though much of this can be recovered

with a large number of partitions and a highly concurrent Producer

implementation.

Multiple Datacenter Deployments

Many web applications are latency sensitive, requiring them to be geo-

distributed around the globe. The connections between these far-flung

datacenters are, unsurprisingly, less reliable than connections within a

datacenter. Kafka helps to deal with potential (and depressingly common)

increased latency and complete connection loss between datacenters by

providing built-in mirroring tools. Using these tools, a Kafka cluster is

established in each datacenter with a retention time designed to balance the

need to cover an extended outage, and the available space in the remote

datacenter. If enough space is available, a longer retention time can be used

as a guard against disaster.

Theseremoteclustersarethencopiedintothemainprocessingclusterusing

a tool called MirrorMaker. This tool, which is shipped with Kafka, can read

from multiple remote clusters and writes the messages there into a single

output cluster. Writes to the same topic in each cluster are merged into a

single topic on the output cluster.

Search WWH ::

Custom Search

Home