HBase - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Our choice of schema is derived from knowing the most efficient way we can read from

HBase. Rows and columns are stored in increasing lexicographical order. Though there

are facilities for secondary indexing and regular expression matching, they come at a per-

formance penalty. It is vital that you understand the most efficient way to query your data

in order to choose the most effective setup for storing and accessing.

For the stations table, the choice of stationid as the key is obvious because we

will always access information for a particular station by its ID. The observations

table, however, uses a composite key that adds the observation timestamp at the end. This

will group all observations for a particular station together, and by using a reverse-order

timestamp ( Long.MAX_VALUE - timestamp ) and storing it as binary, observations

for each station will be ordered with most recent observation first.

NOTE

We rely on the fact that station IDs are a fixed length. In some cases, you will need to zero-pad number

components so row keys sort properly. Otherwise, you will run into the issue where 10 sorts before 2,

say, when only the byte order is considered (02 sorts before 10).

Also, if your keys are integers, use a binary representation rather than persisting the string version of a

number. The former consumes less space.

In the shell, define the tables as follows:

hbase(main):001:0> create 'stations', {NAME => 'info'}

0 row(s) in 0.9600 seconds

hbase(main):002:0> create 'observations', {NAME => 'data'}

0 row(s) in 0.1770 seconds

WIDE TABLES

All access in HBase is via primary key, so the key design should lend itself to how the data is going to be

queried. One thing to keep in mind when designing schemas is that a defining attribute of column(-

family)-oriented stores , such as HBase, is the ability to host wide and sparsely populated tables at no in-

curred cost. [ 139 ]

There is no native database join facility in HBase, but wide tables can make it so that there is no need for

database joins to pull from secondary or tertiary tables. A wide row can sometimes be made to hold all

data that pertains to a particular primary key.

Search WWH ::

Custom Search

Home