Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

ColumnFamilyInputFormat

The ColumnFamilyInputFormat class is an implementation of

org.apache.hadoop.mapred.InputFormat (or mapreduce in newer the API).

So, its implementation is dictated by the InputFormat class specifications. Hadoop uses

this class to get data for the MapReduce tasks. It describes how to read data from column

families into the Mapper instances.

The other job of ColumnFamilyInputFormat (or any implementation of In-

putFormat ) is to fragment input data into small chunks that get fed to map tasks. Cas-

sandra has ColumnInputSplit for this purpose. One can configure the number of rows

per InputSplit via ConfigHelper.setInputSplitSize . However, there is a

caveat. It uses multiple get_slice_range queries for each InputSplit data , so,

as Cassandra documentation says, a smaller value will build up call overhead; on the other

hand, too large a value may cause out-of-memory issues. Larger values are better for per-

formance, so if you are planning to play with this parameter do some calculation based on

median column size to avoid memory overflow. Trial and error can be handy. The default

split size is 64 x 1024 rows.

Search WWH ::

Custom Search

Home