Databases Reference
In-Depth Information
12.3.2 HBase and StumbleUpon
HBase
plays a critical part in StumbleUpon's distributed platform. HBase is a distrib-
uted, column-oriented database that harnesses the power of the Hadoop and HDFS
platform underneath it. But, as with any complex system, there are trade-offs: HBase
shelves traditional relational database concepts, such as joins, foreign key relations,
and triggers in the pursuit of a system that hosts immensely large, sparsely populated
data on commodity hardware in a scalable manner.
AN INTRODUCTION TO HBASE
HBase is modeled after Google's Bigtable,
3
a distributed storage system. Let's recap the
basics of Bigtable
and Bigtable-like systems:
Shares concepts of both column- and row-oriented databases. As described by
the authors, Bigtable is a “a sparse, distributed multidimensional sorted map.”
The basic unit of storage, a table, is split into multiple
tablets
(
regions
in HBase
parlance).
■
Writes are buffered in memory, then flushed into read-only files after a while.
■
To keep the number of files low, they are merged in a
compaction
process that
rewrites
N
files into 1.
■
Special tablets or regions are used to track the locations of the data.
■
Due to the column-oriented nature of the datastore,
sparse
tables—those with a
majority of null cell values—are virtually free as null values aren't stored explicitly.
■
Column families are used to group row columns. All columns in a family
are stored together (for locality) and share storage and configuration
parameters.
■
Table cells are stored with multiple versions instead of overwriting existing data.
■
Capacity (both storage size and processing speed) can be increased by simply
adding machines to the cluster, with no special configuration needed.
■
HBase
provides many additional features:
REST and Thrift
4
gateways allowing for easy access from non-Java development
environments
■
Easy integration with Hadoop MapReduce for data processing
■
Harnesses the proven reliability and scalability of Hadoop and HDFS
■
Web-based UIs for management of both the master and region servers
■
Strong open source community
■
Bigtable: A Distributed Storage System for Structured Data. Chang, et al.
http://labs.google.com/papers/
3
Thrift is a remote procedure call library originally developed at Facebook. It's now an Apache incubator
project at
http://incubator.apache.org/thrift/.
4