Storing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

context.getConfiguration().get("batch","none");

return batch;

}

@Override

public String generatePartitionedPath(JobContext

context,

String topic, int brokerId, int partitionId,

String encodedPartition) {

return topic+"/"+encodedPartition;

}

This Partitioner is the set in the camus.properties file to override

the default behavior:

elt.partitioner.class=wiley.streaming.camus.BatchPartitioner

Using Hadoop for ETL processes

After data is being routinely ingested into Hadoop for storage, it can also

be integrated into ETL processes. This topic is primarily concerned with

real-time analysis so this will be discussed only briefly.

There are a number of options for ETL pipelines in Hadoop. Some of the

more popular options are Pig and Hive. The former is a scripting language

for Hadoop that was developed by Yahoo! for ETL processing. The latter is

a SQL-like interface to Hadoop's Map-Reduce framework, which makes it a

popular choice for database developers.

These tools, among others, are typically used in a multistep pipeline to

produce a number of aggregated outputs. These are then made available

for other pipelines or processing tools. Again, Hive is a popular choice

here because it can be integrated with outside query tools. There are other

specialized tools built to work directly with Hadoop available from

commercial vendors.

The data can also be loaded from Hadoop into another database

environment. Many database vendors provide connectors from Hadoop to

their database to simplify this process. Another, more generic, option is

Sqoop. This is an Apache project that is used for bulk transfers between

Hadoop and other data stores. The package consists of a server that

Search WWH ::

Custom Search

Home