Data-Flow Management in Streaming Analysis - Real-Time Analytics

Database Reference

In-Depth Information

As it happens, this sort of environment is not unique to LinkedIn. Many

companies that deal primarily with “Internet data” find themselves in the

same situation. Additionally, many of them are engineering focused,

meaning that most of their software is developed in-house rather than

licensed from a third party. This allows the companies to use the Kafka

model, and it is useful enough that a similar system, called Kinesis, was

recently announced by Amazon.com . This product aims to make up a core

part of the integration between various Amazon.com services, such as its

key-value store Dynamo, its block storage engine S3, its Hadoop

infrastructure Elastic MapReduce, and its high-performance data

warehouse Redshift.

This section covers the design of Kafka's internals and how they integrate to

solve the problems mentioned here.

Topics, Partitions, and Brokers

The organizing element of Kafka is the “topic.” In the Kafka system, this is a

physical partitioning ofthedatasuchthatalldatacontained within thetopic

should be somehow related. Most commonly, the messages in this topic are

related in that they can be parsed by the same common mechanism and not

much else.

A topic is further subdivided into a number of partitions. These partitions

are, effectively, the limit on the rate that an I/O-bound consumer can

retrieve data from Kafka. This is because clients often use a single consumer

thread (or process) per partition. For example, with Camus, a tool for

moving data from Kafka into the Hadoop Distributed File System (HDFS)

using Hadoop, a Mapper can pull from multiple partitions, but multiple

Mappers will not pull from the same partition.

Partitions are also used to logically organize a topic. Producer

implementations usuallyprovideamechanism tochoosetheKafkapartition

for a given message based on the key of that message.

Partitions themselves are distributed among brokers, which are the physical

processes that make up a Kafka cluster. Typically, each broker in the cluster

corresponds to a separate physical server and manages all of the writes to

that server's disk. The partitions are then uniformly distributed across the

different brokers and, in Kafka 0.8 and later, replicas are distributed across

other brokers in the cluster.

Search WWH ::

Custom Search

Home