Database Reference
In-Depth Information
By default,
Import
will load data directly into HBase. To instead generate HFiles of
data to prepare for bulk data load, pass the following option:
-Dimport.bulk.output=/path/for/output
To apply a generic
org.apache.hadoop.hbase.filter.Filter
ilter to the input,
use the following command:
-Dimport.filter.class=<name of filter class>
-Dimport.filter.args=<comma separated list of args for filter
The ilter will be applied before renaming keys via the
HBASE_
IMPORTER_RENAME_CFS
property. Further, ilters will only
use the
Filter#filterRowKey(byte[] buffer, int
offset, int length)
method to identify whether the
current row needs to be ignored completely for processing
and the
Filter#filterKeyValue(KeyValue)
method to
determine whether the KeyValue should be added;
Filter.
ReturnCode#INCLUDE
and
#INCLUDE_AND_NEXT_COL
will
be considered as including the KeyValue.
For performance, consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
Copy table
The
CopyTable
MapReduce job is used to scan through an HBase table and directly
write to another table. During this process, no intermediate lat ile is created. Using
this utility,
Put
is performed directly into the sink table, which can be on the same
cluster or on an entirely different cluster. Like the export job, we can also specify
the start and end timestamps to ensure ine-grained control over the data. The
CopyTable
MapReduce job is invoked as follows:
$hbase org.apache.hadoop.hbase.mapreduce.CopyTable
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.
name=NEW] [--peer.adr=ADR] <tablename>
Options:
rs.class hbase.regionserver.class of the peer cluster
specify if different from current cluster