Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

12.3.2 HBase and StumbleUpon

HBase

plays a critical part in StumbleUpon's distributed platform. HBase is a distrib-

uted, column-oriented database that harnesses the power of the Hadoop and HDFS

platform underneath it. But, as with any complex system, there are trade-offs: HBase

shelves traditional relational database concepts, such as joins, foreign key relations,

and triggers in the pursuit of a system that hosts immensely large, sparsely populated

data on commodity hardware in a scalable manner.

AN INTRODUCTION TO HBASE

HBase is modeled after Google's Bigtable, 3 a distributed storage system. Let's recap the

basics of Bigtable

and Bigtable-like systems:

Shares concepts of both column- and row-oriented databases. As described by

the authors, Bigtable is a “a sparse, distributed multidimensional sorted map.”

The basic unit of storage, a table, is split into multiple tablets ( regions in HBase

parlance).

■

Writes are buffered in memory, then flushed into read-only files after a while.

■

To keep the number of files low, they are merged in a compaction process that

rewrites N files into 1.

■

Special tablets or regions are used to track the locations of the data.

■

Due to the column-oriented nature of the datastore, sparse tables—those with a

majority of null cell values—are virtually free as null values aren't stored explicitly.

■

Column families are used to group row columns. All columns in a family

are stored together (for locality) and share storage and configuration

parameters.

■

Table cells are stored with multiple versions instead of overwriting existing data.

■

Capacity (both storage size and processing speed) can be increased by simply

adding machines to the cluster, with no special configuration needed.

■

HBase

provides many additional features:

REST and Thrift 4 gateways allowing for easy access from non-Java development

environments

■

Easy integration with Hadoop MapReduce for data processing

■

Harnesses the proven reliability and scalability of Hadoop and HDFS

■

Web-based UIs for management of both the master and region servers

■

Strong open source community

■

Bigtable: A Distributed Storage System for Structured Data. Chang, et al. http://labs.google.com/papers/

bigtable.html.

3

Thrift is a remote procedure call library originally developed at Facebook. It's now an Apache incubator

project at http://incubator.apache.org/thrift/.

4

Hadoop in Action

Search WWH ::

Custom Search

Home