Databases Reference
In-Depth Information
The following concepts are critical to understand how column databases work:
Column family
Super columns
Column
You need to define the schema for tables in relational databases; however, the only
thing that you define in a column family is the name and the key sort options (there is
no schema).
Column families. A column family is how the data is stored on the
disk. All the data in a single column family will sit in the same file
(actually, set of files, but that is close enough). A column family
can contain super columns or columns.
A super column is a dictionary; it is a column that contains other
columns (but not other super columns).
A column is a tuple of name, value, and timestamp.
It is important to understand that schema design in a column family database
(CFDB) is of great importance; if you don't build your schema right, you literally can't
get the data out. CFDB usually offers one of two forms of queries, either by key or by
key range. A CFDB is meant to be distributed, and the key determines where the actual
physical data would be located. Data is stored based on the sort order of the column
family, and you have no real way of changing the sorting (except choosing between
ascending or descending). The sort order, unlike in a relational database, isn't affected by
the columns values but by the column names.
In order to clarify the concepts of column families and the type of problems they
help solve, let's look at an example.
Imagine you have a database that contains census data. The person table
(Figure 6-10 ) has one row for each person who participated in and would probably be
keyed by a unique key. All singleton attributes such as date of birth, gender, address and
so forth would exist in this table. Some repeating attributes like work history wouldbe
normalized out into related tables. Depending upon the size of the sample, a census may
take in hundreds of millions of people, and would look something like Figure 6-10.
 
Search WWH ::




Custom Search