Database Reference
In-Depth Information
Wide row support
Earlier, having multimillion columns was a problem in Cassandra Hadoop integration. It
was pulling a row per call limited by
SlicePredicate
. Version 1.1 onwards, you can
pass the wide row Boolean parameter as
TRUE
, as shown in the following snippet:
ConfigHelper.setInputColumnFamily(
conf,
keyspace,
inCF,
true// SET WIDEROW = TRUE
);
When wide row is set to
true
, the rows are fed one column at a time to the Mapper.
Bulk loading
The
BulkOutputFormat
class is another utility that Cassandra provides to improve
the write performance of jobs that result in large data. It streams the data in a binary
format, which is much quicker than inserting data one by one. It uses
SSTableLoader
Repair, and Backup
. Here's how to set it up:
Job job = new Job(conf);
job.setOutputFormatClass(BulkOutputFormat.class);
Secondary index support
One can use a secondary index when pulling data from Cassandra to pass it on to the job.
This is another improvement. It makes Cassandra shift the data and pass only the relevant
data to Hadoop instead of Hadoop burning the CPU cycles to weed out the data that is not
going to be used in the computation. It lowers the overhead of passing extra data to Ha-
doop. Here is an example:
IndexExpressionelectronicItems =
newIndexExpression(
ByteBufferUtil.bytes("item_category"),
IndexOperator.EQ,
ByteBufferUtil.bytes("electronics")