Database Reference
In-Depth Information
HFiles. The
get
command is instantaneously processed and the appropriate data
returned to the client.
Over time, as the smaller HFiles accumulate, the worker node runs a
major
compaction
that merges the smaller HFiles into one large HFile. During the
major compaction, the deleted entries and the tombstone markers are permanently
removed from the files.
Use Cases for HBase
As described in Google's Bigtable paper, a common use case for a data store such
as HBase is to store the results from a web crawler. Using this paper's example, the
column family, called
anchor
, is defined to capture the website URLs that provide
links to the row's website. What may not be an obvious implementation is that
those anchoring website URLs are used as the column qualifiers. For example, if
the cell is simply the text on the website that provides the link. Here is how the
CNN example may look in HBase following a
get
operation.
hbase> get 'web_table', 'com.cnn.www', {VERSIONS => 2}
COLUMN CELL
anchor:sportsillustrated.cnn.com timestamp=1380224620597,
value=cnn
anchor:sportsillustrated.cnn.com timestamp=1380224000001,
value=cnn.com
anchor:edition.cnn.com timestamp=1380224620597,
value=cnn
Additional results are returned for each corresponding website that provides a
(
.com
,
.gov
, or
.net
) that correspond to the Internet's top-level domains are
stored in order. Also, the next part of the domain name (
cnn
) is stored in order.