Hadoop Ecosystem Integration - Apache Sqoop

Databases Reference

In-Depth Information

Discussion

HBase does not allow the insertion of empty values: each cell needs to have at least one

byte. Sqoop serialization, however, skips all columns that contain a NULL value, re‐

sulting in skipping rows containing NULL value in all columns. This explains why Sqoop

imports fewer rows than are available in your source table. The property

sqoop.hbase.add.row.key instructs Sqoop to insert the row key column twice, once

as a row identifier and then again in the data itself. Even if all other columns contain

NULL , at least the column used for the row key won't be null, which will allow the insertion

of the row into HBase.

6.11. Improving Performance When Importing into HBase

Problem

Imports into HBase take significantly more time than importing as text files in HDFS.

Solution

Create your HBase table prior to running Sqoop import, and instruct HBase to create

more regions with the parameter NUMREGIONS . For example, you can create the HBase

table cities with the column family world and 20 regions using the following

command:

hbase> create 'cities' , 'world' , { NUMREGIONS = > 20, SPLITALGO = > 'HexString

Split' }

Discussion

By default, every new HBase table has only one region, which can be served by only one

Region Server. This means that every new table will be served by only one physical node.

Sqoop does parallel import of your data into HBase, but the parallel tasks will bottleneck

when inserting data into one single region. Eventually the region will split up as it fills,

allowing Sqoop to write to two servers, which does not help significantly. Over time,

enough region splitting will occur to help spread the load across your entire HBase

cluster. It will, however, be too late. Your Sqoop import by then has already taken a

significant performance hit. Our recommendation is, prior to running the Sqoop im‐

port, create the HBase table with a sufficient number of regions to spread the load across

your entire HBase cluster.

Search WWH ::

Custom Search

Home