Database Reference
In-Depth Information
Columnar Databases
A columnar database simply means a data store that organizes data around columns instead of
rows. This slight shift in focus optimizes the workload for certain kinds of problems—in particu-
lar, data warehouses and analytics applications that require computing aggregate values over very
large sets of similar data. Columnar (or “column-oriented”) databases are well-suited to online
analytical processing (OLAP) work, where queries are executed over a broad dataset.
Data storage works a little differently with columnar databases, in order to optimize disk space
and the amount of time spent in IO. For example, columnar databases allow you to write a record
containing a value for only one out of a large number of possible columns, and only that single
column value will be stored and take up space. This is different from RDBMS, in which nulls
are not stored for free. It can be useful to think of RDBMS tables like spreadsheets, in which all
columns are of the same size for each row, and null values are maintained to keep the grid-like
shape of the data structure. This model doesn't work for columnar databases, though, because
null values are not present. It's more helpful to think of columnar data as tags: values can be of
arbitrary length, and the names and widths of columns are not preset.
Columnar databases often require the data to be of a uniform type, which presents an opportunity
for data compression.
Columnar databases have been around since the early 1970s. Sybase IQ, for example, is one of
these, and was for many years the only commercial columnar database.
But of the recent (mostly open source) projects that are part of the NoSQL conversation, there
are a few databases that are an evolution of basic key-value stores in that they feature a richer
data model. You can think of these columnar databases as multidimensional key-value stores or
distributed hash tables that, instead of supporting merely straight key-value pairs, allow for ar-
rangements called column families to help organize columns and provide a richer model. These
are Google's Bigtable, HBase, Hypertable, and Cassandra.
Google's Bigtable is really the parent of the modern columnar databases. It is proprietary, but
there are a few published papers on its design, and each of the other columnar databases dis-
cussed are implementations that closely follow Bigtable's design or, as in the case of Cassandra,
take certain key ideas from Bigtable.
Google Bigtable
Bigtable is Google's internally used custom database, designed to scale into the petabyte range.
Bigtable is described in the paper published by Google in 2006 called “Bigtable: A Distributed
Storage System for Structured Data.” The goals of the project are stated in that paper: “wide
applicability, scalability, high performance, and high availability.” Bigtable is used extensively
Search WWH ::




Custom Search