Database Reference
In-Depth Information
Building an Online Query Application
Although HDFS and MapReduce are powerful tools for processing batch operations over
large datasets, they do not provide ways to read or write individual records efficiently. In
this example, we'll explore using HBase as the tool to fill this gap.
The existing weather dataset described in previous chapters contains observations for tens
of thousands of stations over 100 years, and this data is growing without bound. In this ex-
ample, we will build a simple online (as opposed to batch) interface that allows a user to
navigate the different stations and page through their historical temperature observations in
time order. We'll build simple command-line Java applications for this, but it's easy to see
how the same techniques could be used to build a web application to do the same thing.
For the sake of this example, let us allow that the dataset is massive, that the observations
run to the billions, and that the rate at which temperature updates arrive is significant —
say, hundreds to thousands of updates per second from around the world and across the
whole range of weather stations. Also, let us allow that it is a requirement that the online
application must display the most up-to-date observation within a second or so of receipt.
The first size requirement should preclude our use of a simple RDBMS instance and make
HBase a candidate store. The second latency requirement rules out plain HDFS. A MapRe-
duce job could build initial indices that allowed random access over all of the observation
data, but keeping up this index as the updates arrive is not what HDFS and MapReduce are
good at.
Schema Design
In our example, there will be two tables:
stations
This table holds station data. Let the row key be the stationid . Let this table have a
column family info that acts as a key-value dictionary for station information. Let the
dictionary keys be the column names info:name , info:location , and
info:description . This table is static, and in this case, the info family closely
mirrors a typical RDBMS table design.
observations
This table holds temperature observations. Let the row key be a composite key of sta-
tionid plus a reverse-order timestamp. Give this table a column family data that
will contain one column, airtemp , with the observed temperature as the column
value.
Search WWH ::




Custom Search