Databases Reference
In-Depth Information
In this example, a column family in our person table is a list of cars the person has
owned, and the year of manufacture for each car. This would mean a sub-table in a
relational data model; but a CFDB column family can accommodate this because it can
contain many name/value pairs, where the name is the column name and the value is the
value of that column for that row. It is important to realize that the names of the columns
in a single family can vary arbitrarily for each row.
The column families thus can be divided into static and dynamic families. Static
families like personal data and demographic data in our examples above have mostly
the same column names on every row. Dynamic families like cars owned contain mostly
different column names for each row.
A CfDB design seems to have few design considerations that are fundamental to
data access: no joins, no real querying capability (except by primary key), nothing like the
richness that we get from a relational database. Why is it so limited?
Note
A CFDB is designed to run on a large number of machines and to store a huge
amount of information. You literally cannot store that amount of data in a relational
database, and even multi-machine relational databases, such as Oracle RAC, will struggle
to handle the size of data and queries that are typical for CFDB.
The reason that a CFDB design doesn't provide joins is that joins require you to
be able to scan the entire data set. That requires either someplace that has a view of
the whole database (resulting in a bottleneck and a single point of failure) or actually
executing a query over all machines in the cluster. Since that number can be pretty high,
you would want to avoid such situations.
CFDB designs don't provide a way to query by column or value because that would
necessitate either an index of the entire data set (or just in a single column family), which
again is not practical, or running the query on all machines, which is not possible. By
limiting queries to just those done by key, a CFDB design ensures that it knows exactly
what node a query can run on. It means that each query is running on a small set of data,
making them much cheaper and faster.
Model Column Families Around Query Patterns
As discussed earlier, No SQL data modeling is always based on query patterns; however,
it is also important to understand the business context behind the objects of interest:
hence, start your design with entities and relationships, if you can. Unlike in relational
databases, it's not easy to tune or introduce new query patterns in by simply creating
secondary indexes or building complex SQLs (using joins, order by, group by) because
of its high-scale distributed nature. So think about query patterns up front, and design
column families accordingly.
Entities and their relationships still matter (unless the use case is special, perhaps
storing logs or other time series data). What if you are given query patterns to create a
data model for an e-commerce website, but you were not told anything about the entities
and relationships? You might try to figure out entities and relationships, knowingly or
 
 
Search WWH ::




Custom Search