Database Reference
In-Depth Information
The
Export
job takes the source table name and the output directory name as inputs.
The number of versions, ilters, and start and end timestamps can also be provided
with the export job to have ine-grained control. Here, the start and end timestamps
help in executing the incremental export from the tables. The data is written as
Hadoop
SequenceFiles
in the speciied output directory. The
SequenceFiles
data is keyed from rowkey to persist
Result
instances:
$hbase org.apache.hadoop.hbase.mapreduce.Export
Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions>
[<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]
The
-D
properties will be applied to the conf used; for example:
-D mapred.output.compress=true
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.
GzipCodec
-D mapred.output.compression.type=BLOCK
Additionally, the following SCAN properties can be speciied to control/limit what
is exported:
-D hbase.mapreduce.scan.column.family=<familyName>
-D hbase.mapreduce.include.deleted.rows=true
For performance, consider the following properties:
-Dhbase.client.scanner.caching=100
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
For tables with very wide rows, consider setting the batch size as follows:
-Dhbase.export.scanner.batch=10
The
Import
job reads the records from the source sequential ile by creating
Put
instances from the persisted
Result
instances. It then uses the HTable API to write
these puts to the target table. The
Import
option does not provide iltering of the
data while inserting into tables, and for any additional data manipulation, custom
implementation needs to be provided by extending the
Import
class.
$hbase org.apache.hadoop.hbase.mapreduce.Import
Usage: Import [options] <tablename> <inputdir>