Database Reference
In-Depth Information
context.getConfiguration().get("batch","none");
return batch;
}
@Override
public String generatePartitionedPath(JobContext
context,
String topic, int brokerId, int partitionId,
String encodedPartition) {
return topic+"/"+encodedPartition;
}
}
This Partitioner is the set in the camus.properties file to override
the default behavior:
elt.partitioner.class=wiley.streaming.camus.BatchPartitioner
Using Hadoop for ETL processes
After data is being routinely ingested into Hadoop for storage, it can also
be integrated into ETL processes. This topic is primarily concerned with
real-time analysis so this will be discussed only briefly.
There are a number of options for ETL pipelines in Hadoop. Some of the
more popular options are Pig and Hive. The former is a scripting language
for Hadoop that was developed by Yahoo! for ETL processing. The latter is
a SQL-like interface to Hadoop's Map-Reduce framework, which makes it a
popular choice for database developers.
These tools, among others, are typically used in a multistep pipeline to
produce a number of aggregated outputs. These are then made available
for other pipelines or processing tools. Again, Hive is a popular choice
here because it can be integrated with outside query tools. There are other
specialized tools built to work directly with Hadoop available from
commercial vendors.
The data can also be loaded from Hadoop into another database
environment. Many database vendors provide connectors from Hadoop to
their database to simplify this process. Another, more generic, option is
Sqoop. This is an Apache project that is used for bulk transfers between
Hadoop and other data stores. The package consists of a server that
Search WWH ::




Custom Search