Data-Flow Management in Streaming Analysis - Real-Time Analytics

Database Reference

In-Depth Information

If the number of brokers changes, partitions and their replicas can be

reassignedtootherbrokersinthecluster.Whenconsumingfromatopic,the

consumer application, or consumer group if it is a distributed application,

will assign a single thread or process to each partition. These independent

threads then process each partition at their own pace, much like the Map

phase of a Map-Reduce application. Kafka's high-level consumer

implementation tracks the consumption of the various threads, allowing

processing to be restarted if an individual process or thread is interrupted.

While this model can be circumvented to improve consumption using

Kafka's low-level interfaces, the preferred mechanism is to increase the

number of partitions in a topic as necessary. To accomplish this task, Kafka

provides tools to add partitions to existing topics in a live environment.

Log-Structured Storage

Kafka is structured around an append-only log mechanism similar to the

write-ahead-log protocol found in database applications. The

write-ahead-log—which was apparently first developed by Ron Obermark

in 1974 while he was working on IBM's System R—essentially says that

before an object can be mutated, the log of its mutation must first have been

committed to a recovery log. This forms an “undo” log of message mutations

that can be applied or removed from the database to reconstruct the state of

the database at any particular moment in time.

To accomplish this, the recovery log is structured such that with every

change it assigned a unique and increasing number (what mathematicians

would call a strictly monotonically increasing sequence) and appended to

the end of a theoretically infinite file. Often, this increasing number is the

position of the record within the file because it is usually easy to obtain this

information when appending to a file. Of course, in a practical system, no

file can be infinitely large, so after a file reaches its maximum size, a new file

is created and the offsets are reset. If the files are also named using a strictly

monotonically increasing sequence, the semantics of the write-ahead-log is

completely maintained.

This approach is not only simple, but it also maximizes the performance

of the storage media most often used to maintain these logs. Back when

they were first being developed, write-ahead-logs would have been written

to tape storage media (each tape holding approximately 50MB of data).

Tape, of course, works best with sequential writes. Even modern spinning

Search WWH ::

Custom Search

Home