Databases Reference
In-Depth Information
12.3.2 HBase and StumbleUpon
HBase
plays a critical part in StumbleUpon's distributed platform. HBase is a distrib-
uted, column-oriented database that harnesses the power of the Hadoop and HDFS
platform underneath it. But, as with any complex system, there are trade-offs: HBase
shelves traditional relational database concepts, such as joins, foreign key relations,
and triggers in the pursuit of a system that hosts immensely large, sparsely populated
data on commodity hardware in a scalable manner.
AN INTRODUCTION TO HBASE
HBase is modeled after Google's Bigtable, 3 a distributed storage system. Let's recap the
basics of Bigtable
and Bigtable-like systems:
Shares concepts of both column- and row-oriented databases. As described by
the authors, Bigtable is a “a sparse, distributed multidimensional sorted map.”
The basic unit of storage, a table, is split into multiple tablets ( regions in HBase
parlance).
Writes are buffered in memory, then flushed into read-only files after a while.
To keep the number of files low, they are merged in a compaction process that
rewrites N files into 1.
Special tablets or regions are used to track the locations of the data.
Due to the column-oriented nature of the datastore, sparse tables—those with a
majority of null cell values—are virtually free as null values aren't stored explicitly.
Column families are used to group row columns. All columns in a family
are stored together (for locality) and share storage and configuration
parameters.
Table cells are stored with multiple versions instead of overwriting existing data.
Capacity (both storage size and processing speed) can be increased by simply
adding machines to the cluster, with no special configuration needed.
HBase
provides many additional features:
REST and Thrift 4 gateways allowing for easy access from non-Java development
environments
Easy integration with Hadoop MapReduce for data processing
Harnesses the proven reliability and scalability of Hadoop and HDFS
Web-based UIs for management of both the master and region servers
Strong open source community
Bigtable: A Distributed Storage System for Structured Data. Chang, et al. http://labs.google.com/papers/
bigtable.html.
3
Thrift is a remote procedure call library originally developed at Facebook. It's now an Apache incubator
project at http://incubator.apache.org/thrift/.
4
 
Search WWH ::




Custom Search