HBase - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

given the distribution of stationid values and how TextInputFormat makes

splits, the upload should be sufficiently distributed.

If a table is new, it will have only one region, and all updates will be to this single region

until it splits. This will happen even if row keys are randomly distributed. This startup

phenomenon means uploads run slowly at first, until there are sufficient regions distrib-

uted so all cluster members are able to participate in the uploads. Do not confuse this phe-

nomenon with that noted in the previous paragraph.

Both of these problems can be avoided by using bulk loads, discussed next.

Bulk load

HBase has an efficient facility for bulk loading HBase by writing its internal data format

directly into the filesystem from MapReduce. Going this route, it's possible to load an

HBase instance at rates that are an order of magnitude or more beyond those attainable by

writing via the HBase client API.

Bulk loading is a two-step process. The first step uses HFileOutputFormat2 to write

HFiles to an HDFS directory using a MapReduce job. Since rows have to be written in or-

der, the job must perform a total sort (see Total Sort ) of the row keys. The config-

ureIncrementalLoad() method of HFileOutputFormat2 does all the neces-

sary configuration for you.

The second step of the bulk load involves moving the HFiles from HDFS into an existing

HBase table. The table can be live during this process. The example code includes a class

called HBaseTemperatureBulkImporter for loading the observation data using a

bulk load.

Online Queries

To implement the online query application, we will use the HBase Java API directly. Here

it becomes clear how important your choice of schema and storage format is.

Station queries

The simplest query will be to get the static station information. This is a single row look-

up, performed using a get() operation. This type of query is simple in a traditional data-

base, but HBase gives you additional control and flexibility. Using the info family as a

key-value dictionary (column names as keys, column values as values), the code from

HBaseStationQuery looks like this:

Search WWH ::

Custom Search

Home