Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Wide row support

Earlier, having multimillion columns was a problem in Cassandra Hadoop integration. It

was pulling a row per call limited by SlicePredicate . Version 1.1 onwards, you can

pass the wide row Boolean parameter as TRUE , as shown in the following snippet:

ConfigHelper.setInputColumnFamily(

conf,

keyspace,

inCF,

true// SET WIDEROW = TRUE

);

When wide row is set to true , the rows are fed one column at a time to the Mapper.

Bulk loading

The BulkOutputFormat class is another utility that Cassandra provides to improve

the write performance of jobs that result in large data. It streams the data in a binary

format, which is much quicker than inserting data one by one. It uses SSTableLoader

to do this. Refer to SSTableLoader in Chapter 6 , Managing a Cluster - Scaling, Node

Repair, and Backup . Here's how to set it up:

Job job = new Job(conf);

job.setOutputFormatClass(BulkOutputFormat.class);

Secondary index support

One can use a secondary index when pulling data from Cassandra to pass it on to the job.

This is another improvement. It makes Cassandra shift the data and pass only the relevant

data to Hadoop instead of Hadoop burning the CPU cycles to weed out the data that is not

going to be used in the computation. It lowers the overhead of passing extra data to Ha-

doop. Here is an example:

IndexExpressionelectronicItems =

newIndexExpression(

ByteBufferUtil.bytes("item_category"),

IndexOperator.EQ,

ByteBufferUtil.bytes("electronics")

Search WWH ::

Custom Search

Home