Database Reference
In-Depth Information
ColumnFamilyInputFormat
The ColumnFamilyInputFormat class is an implementation of
org.apache.hadoop.mapred.InputFormat (or mapreduce in newer the API).
So, its implementation is dictated by the InputFormat class specifications. Hadoop uses
this class to get data for the MapReduce tasks. It describes how to read data from column
families into the Mapper instances.
The other job of ColumnFamilyInputFormat (or any implementation of In-
putFormat ) is to fragment input data into small chunks that get fed to map tasks. Cas-
sandra has ColumnInputSplit for this purpose. One can configure the number of rows
per InputSplit via ConfigHelper.setInputSplitSize . However, there is a
caveat. It uses multiple get_slice_range queries for each InputSplit data , so,
as Cassandra documentation says, a smaller value will build up call overhead; on the other
hand, too large a value may cause out-of-memory issues. Larger values are better for per-
formance, so if you are planning to play with this parameter do some calculation based on
median column size to avoid memory overflow. Trial and error can be handy. The default
split size is 64 x 1024 rows.
Search WWH ::




Custom Search