NoSQL data architecture patterns - Making Sense of NoSQL

Databases Reference

In-Depth Information

The key in the figure is typical of column stores. Unlike the typical spreadsheet, which

might have 100 rows and 100 columns, column family stores are designed to

be...well...very big. How big? Systems with billions of rows and hundreds or thousands

of columns are not unheard of. For example, a Geographic Information System ( GIS )

like Google Earth might have a row ID for the longitude portion of a map and use the

column name for the latitude of the map. If you have one map for each square mile

on Earth, you could have 15,000 distinct row ID s and 15,000 distinct column ID s.

What's unusual about these large implementations is that if you viewed them in a

spreadsheet, you'd see that few cells contain data. This sparse matrix implementation is

a grid of values where only a small percent of cells contain values. Unfortunately, rela-

tional databases aren't efficient at storing sparse data, but column stores are designed

exactly for this purpose.

With a traditional relational database, you can use a simple SQL query to find all

the columns in any table; when querying sparse matrix systems, you must look for

every element in the database to get a full listing of all column names. One problem

that may occur with many columns is that running reports that list columns and

related columns can be tricky unless you use a column family (a high-level category of

data also known as an upper level ontology ). For example, you may have groups of col-

umns that describe a website, a person, a geographical location, and products for sale.

In order to view these columns together, you'd group them in the same column family

to make retrieval easier.

Not all column family stores use a column family as part of their key. If they do,

you'll need to take this into account when storing an item key, since the column fam-

ily is part of the key, and retrieval of data can't occur without it. In as much as the API

is simple, NoSQL products can scale to manage large volumes of data, adding new

rows and columns without needing to modify a data definition language.

4.3.3

Benefits of column family systems

The column family approach of using a row ID and column name as a lookup key is a

flexible way to store data, gives you benefits of higher scalability and availability, and

saves you time and hassles when adding new data to your system. As you read through

these benefits, think about the data your organization collects to see if a column fam-

ily store would help you gain a competitive advantage in your market.

Since column family systems don't rely on joins, they tend to scale well on distrib-

uted systems. Although you can start your development on a single laptop, in produc-

tion column family systems are usually configured to store data in three distinct nodes

in possibly different geographic regions (geographically distinct data centers) to

ensure high availability. Column family systems have automatic failover built in to

detect failing nodes and algorithms to identify corrupt data. They leverage advanced

hashing and indexing tools such as Bloom filters to perform probabilistic analysis on

large data sets. The larger the dataset, the better these tools perform. Finally, column

family implementations are designed to work with distributed filesystems (such as the

Search WWH ::

Custom Search

Home