HBase - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Building an Online Query Application

Although HDFS and MapReduce are powerful tools for processing batch operations over

large datasets, they do not provide ways to read or write individual records efficiently. In

this example, we'll explore using HBase as the tool to fill this gap.

The existing weather dataset described in previous chapters contains observations for tens

of thousands of stations over 100 years, and this data is growing without bound. In this ex-

ample, we will build a simple online (as opposed to batch) interface that allows a user to

navigate the different stations and page through their historical temperature observations in

time order. We'll build simple command-line Java applications for this, but it's easy to see

how the same techniques could be used to build a web application to do the same thing.

For the sake of this example, let us allow that the dataset is massive, that the observations

run to the billions, and that the rate at which temperature updates arrive is significant —

say, hundreds to thousands of updates per second from around the world and across the

whole range of weather stations. Also, let us allow that it is a requirement that the online

application must display the most up-to-date observation within a second or so of receipt.

The first size requirement should preclude our use of a simple RDBMS instance and make

HBase a candidate store. The second latency requirement rules out plain HDFS. A MapRe-

duce job could build initial indices that allowed random access over all of the observation

data, but keeping up this index as the updates arrive is not what HDFS and MapReduce are

good at.

Schema Design

In our example, there will be two tables:

stations

This table holds station data. Let the row key be the stationid . Let this table have a

column family info that acts as a key-value dictionary for station information. Let the

dictionary keys be the column names info:name , info:location , and

info:description . This table is static, and in this case, the info family closely

mirrors a typical RDBMS table design.

observations

This table holds temperature observations. Let the row key be a composite key of sta-

tionid plus a reverse-order timestamp. Give this table a column family data that

will contain one column, airtemp , with the observed temperature as the column

value.

Search WWH ::

Custom Search

Home