Databases Reference
In-Depth Information
Discussion
HBase does not allow the insertion of empty values: each cell needs to have at least one
byte. Sqoop serialization, however, skips all columns that contain a NULL value, re‐
sulting in skipping rows containing NULL value in all columns. This explains why Sqoop
imports fewer rows than are available in your source table. The property
sqoop.hbase.add.row.key instructs Sqoop to insert the row key column twice, once
as a row identifier and then again in the data itself. Even if all other columns contain
NULL , at least the column used for the row key won't be null, which will allow the insertion
of the row into HBase.
6.11. Improving Performance When Importing into HBase
Problem
Imports into HBase take significantly more time than importing as text files in HDFS.
Solution
Create your HBase table prior to running Sqoop import, and instruct HBase to create
more regions with the parameter NUMREGIONS . For example, you can create the HBase
table cities with the column family world and 20 regions using the following
command:
hbase> create 'cities' , 'world' , { NUMREGIONS = > 20, SPLITALGO = > 'HexString
Split' }
Discussion
By default, every new HBase table has only one region, which can be served by only one
Region Server. This means that every new table will be served by only one physical node.
Sqoop does parallel import of your data into HBase, but the parallel tasks will bottleneck
when inserting data into one single region. Eventually the region will split up as it fills,
allowing Sqoop to write to two servers, which does not help significantly. Over time,
enough region splitting will occur to help spread the load across your entire HBase
cluster. It will, however, be too late. Your Sqoop import by then has already taken a
significant performance hit. Our recommendation is, prior to running the Sqoop im‐
port, create the HBase table with a sufficient number of regions to spread the load across
your entire HBase cluster.
Search WWH ::




Custom Search