Data-Flow Management in Streaming Analysis - Real-Time Analytics

Database Reference

In-Depth Information

KafkaStream< byte [], byte []> stream =

consumerMap.get(topic).get(0);

The KafkaStream object defines an iterator with a class of

ConsumerIterator , which is used to retrieve data from the stream.

Calling the next method on the Iterator returns a Message object that

can be used to extract the Key and the Message:

ConsumerIterator< byte [], byte []> iter =

stream.iterator();

while (iter.hasNext()) {

String message = new String(iter.next().message());

System. out .println(message);

}

Apache Flume: Distributed Log Collection

Kafka was designed and built within a specific production environment. In

this case, its designers essentially had complete control over the available

software stack. In addition, they were replacing an existing component with

similar producer/consumer semantics (though different delivery

semantics), making Kafka relatively easy to integrate from an engineering

perspective.

In many production environments, developers are not so lucky. There often

are a number of pre-existing legacy systems and data collection or logging

paths. In addition, for a variety of reasons, it is not possible to modify these

systems' logging behavior. For example, the software could be produced by

an outside vendor with their own goals and development roadmap.

To integrate into these environments, there is the Apache Flume project.

Rather than providing a single opinionated mechanism for log collection,

like Kafka, Flume is a framework for integrating the data ingress and egress

of a variety of data services. Originally, this was primarily the collection

of logs from things like web servers with transport to HDFS in a Hadoop

environment, but Flume has since expanded to cover a much larger set of

use-cases, sometimes blurring the line between the data motion discussed

in this chapter and the processing systems discussed in Chapter 5.

Search WWH ::

Custom Search

Home