Databases Reference
In-Depth Information
The key in the figure is typical of column stores. Unlike the typical spreadsheet, which
might have 100 rows and 100 columns, column family stores are designed to
be...well...very big. How big? Systems with billions of rows and hundreds or thousands
of columns are not unheard of. For example, a Geographic Information System ( GIS )
like Google Earth might have a row ID for the longitude portion of a map and use the
column name for the latitude of the map. If you have one map for each square mile
on Earth, you could have 15,000 distinct row ID s and 15,000 distinct column ID s.
What's unusual about these large implementations is that if you viewed them in a
spreadsheet, you'd see that few cells contain data. This sparse matrix implementation is
a grid of values where only a small percent of cells contain values. Unfortunately, rela-
tional databases aren't efficient at storing sparse data, but column stores are designed
exactly for this purpose.
With a traditional relational database, you can use a simple SQL query to find all
the columns in any table; when querying sparse matrix systems, you must look for
every element in the database to get a full listing of all column names. One problem
that may occur with many columns is that running reports that list columns and
related columns can be tricky unless you use a column family (a high-level category of
data also known as an upper level ontology ). For example, you may have groups of col-
umns that describe a website, a person, a geographical location, and products for sale.
In order to view these columns together, you'd group them in the same column family
to make retrieval easier.
Not all column family stores use a column family as part of their key. If they do,
you'll need to take this into account when storing an item key, since the column fam-
ily is part of the key, and retrieval of data can't occur without it. In as much as the API
is simple, NoSQL products can scale to manage large volumes of data, adding new
rows and columns without needing to modify a data definition language.
4.3.3
Benefits of column family systems
The column family approach of using a row ID and column name as a lookup key is a
flexible way to store data, gives you benefits of higher scalability and availability, and
saves you time and hassles when adding new data to your system. As you read through
these benefits, think about the data your organization collects to see if a column fam-
ily store would help you gain a competitive advantage in your market.
Since column family systems don't rely on joins, they tend to scale well on distrib-
uted systems. Although you can start your development on a single laptop, in produc-
tion column family systems are usually configured to store data in three distinct nodes
in possibly different geographic regions (geographically distinct data centers) to
ensure high availability. Column family systems have automatic failover built in to
detect failing nodes and algorithms to identify corrupt data. They leverage advanced
hashing and indexing tools such as Bloom filters to perform probabilistic analysis on
large data sets. The larger the dataset, the better these tools perform. Finally, column
family implementations are designed to work with distributed filesystems (such as the
Search WWH ::




Custom Search