Database Reference
In-Depth Information
given the distribution of stationid values and how TextInputFormat makes
splits, the upload should be sufficiently distributed.
If a table is new, it will have only one region, and all updates will be to this single region
until it splits. This will happen even if row keys are randomly distributed. This startup
phenomenon means uploads run slowly at first, until there are sufficient regions distrib-
uted so all cluster members are able to participate in the uploads. Do not confuse this phe-
nomenon with that noted in the previous paragraph.
Both of these problems can be avoided by using bulk loads, discussed next.
Bulk load
HBase has an efficient facility for bulk loading HBase by writing its internal data format
directly into the filesystem from MapReduce. Going this route, it's possible to load an
HBase instance at rates that are an order of magnitude or more beyond those attainable by
writing via the HBase client API.
Bulk loading is a two-step process. The first step uses HFileOutputFormat2 to write
HFiles to an HDFS directory using a MapReduce job. Since rows have to be written in or-
der, the job must perform a total sort (see Total Sort ) of the row keys. The config-
ureIncrementalLoad() method of HFileOutputFormat2 does all the neces-
sary configuration for you.
The second step of the bulk load involves moving the HFiles from HDFS into an existing
HBase table. The table can be live during this process. The example code includes a class
called HBaseTemperatureBulkImporter for loading the observation data using a
bulk load.
Online Queries
To implement the online query application, we will use the HBase Java API directly. Here
it becomes clear how important your choice of schema and storage format is.
Station queries
The simplest query will be to get the static station information. This is a single row look-
up, performed using a get() operation. This type of query is simple in a traditional data-
base, but HBase gives you additional control and flexibility. Using the info family as a
key-value dictionary (column names as keys, column values as values), the code from
HBaseStationQuery looks like this:
Search WWH ::




Custom Search