Database Reference
In-Depth Information
Wide row support
Earlier, having multimillion columns was a problem in Cassandra Hadoop integration. It
was pulling a row per call limited by SlicePredicate . Version 1.1 onwards, you can
pass the wide row Boolean parameter as TRUE , as shown in the following snippet:
ConfigHelper.setInputColumnFamily(
conf,
keyspace,
inCF,
true// SET WIDEROW = TRUE
);
When wide row is set to true , the rows are fed one column at a time to the Mapper.
Bulk loading
The BulkOutputFormat class is another utility that Cassandra provides to improve
the write performance of jobs that result in large data. It streams the data in a binary
format, which is much quicker than inserting data one by one. It uses SSTableLoader
to do this. Refer to SSTableLoader in Chapter 6 , Managing a Cluster - Scaling, Node
Repair, and Backup . Here's how to set it up:
Job job = new Job(conf);
job.setOutputFormatClass(BulkOutputFormat.class);
Secondary index support
One can use a secondary index when pulling data from Cassandra to pass it on to the job.
This is another improvement. It makes Cassandra shift the data and pass only the relevant
data to Hadoop instead of Hadoop burning the CPU cycles to weed out the data that is not
going to be used in the computation. It lowers the overhead of passing extra data to Ha-
doop. Here is an example:
IndexExpressionelectronicItems =
newIndexExpression(
ByteBufferUtil.bytes("item_category"),
IndexOperator.EQ,
ByteBufferUtil.bytes("electronics")
Search WWH ::




Custom Search