The Nonrelational Landscape - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

Columnar Databases

A columnar database simply means a data store that organizes data around columns instead of

rows. This slight shift in focus optimizes the workload for certain kinds of problems—in particu-

lar, data warehouses and analytics applications that require computing aggregate values over very

large sets of similar data. Columnar (or “column-oriented”) databases are well-suited to online

analytical processing (OLAP) work, where queries are executed over a broad dataset.

Data storage works a little differently with columnar databases, in order to optimize disk space

and the amount of time spent in IO. For example, columnar databases allow you to write a record

containing a value for only one out of a large number of possible columns, and only that single

column value will be stored and take up space. This is different from RDBMS, in which nulls

are not stored for free. It can be useful to think of RDBMS tables like spreadsheets, in which all

columns are of the same size for each row, and null values are maintained to keep the grid-like

shape of the data structure. This model doesn't work for columnar databases, though, because

null values are not present. It's more helpful to think of columnar data as tags: values can be of

arbitrary length, and the names and widths of columns are not preset.

Columnar databases often require the data to be of a uniform type, which presents an opportunity

for data compression.

Columnar databases have been around since the early 1970s. Sybase IQ, for example, is one of

these, and was for many years the only commercial columnar database.

But of the recent (mostly open source) projects that are part of the NoSQL conversation, there

are a few databases that are an evolution of basic key-value stores in that they feature a richer

data model. You can think of these columnar databases as multidimensional key-value stores or

distributed hash tables that, instead of supporting merely straight key-value pairs, allow for ar-

rangements called column families to help organize columns and provide a richer model. These

are Google's Bigtable, HBase, Hypertable, and Cassandra.

Google's Bigtable is really the parent of the modern columnar databases. It is proprietary, but

there are a few published papers on its design, and each of the other columnar databases dis-

cussed are implementations that closely follow Bigtable's design or, as in the case of Cassandra,

take certain key ideas from Bigtable.

Google Bigtable

Bigtable is Google's internally used custom database, designed to scale into the petabyte range.

Bigtable is described in the paper published by Google in 2006 called “Bigtable: A Distributed

Storage System for Structured Data.” The goals of the project are stated in that paper: “wide

applicability, scalability, high performance, and high availability.” Bigtable is used extensively

Search WWH ::

Custom Search

Home