Data Modeling Approaches for Big Data and Analytics Solutions - Big Data Imperatives

Databases Reference

In-Depth Information

In this example, a column family in our person table is a list of cars the person has

owned, and the year of manufacture for each car. This would mean a sub-table in a

relational data model; but a CFDB column family can accommodate this because it can

contain many name/value pairs, where the name is the column name and the value is the

value of that column for that row. It is important to realize that the names of the columns

in a single family can vary arbitrarily for each row.

The column families thus can be divided into static and dynamic families. Static

families like personal data and demographic data in our examples above have mostly

the same column names on every row. Dynamic families like cars owned contain mostly

different column names for each row.

■ A CfDB design seems to have few design considerations that are fundamental to

data access: no joins, no real querying capability (except by primary key), nothing like the

richness that we get from a relational database. Why is it so limited?

Note

A CFDB is designed to run on a large number of machines and to store a huge

amount of information. You literally cannot store that amount of data in a relational

database, and even multi-machine relational databases, such as Oracle RAC, will struggle

to handle the size of data and queries that are typical for CFDB.

The reason that a CFDB design doesn't provide joins is that joins require you to

be able to scan the entire data set. That requires either someplace that has a view of

the whole database (resulting in a bottleneck and a single point of failure) or actually

executing a query over all machines in the cluster. Since that number can be pretty high,

you would want to avoid such situations.

CFDB designs don't provide a way to query by column or value because that would

necessitate either an index of the entire data set (or just in a single column family), which

again is not practical, or running the query on all machines, which is not possible. By

limiting queries to just those done by key, a CFDB design ensures that it knows exactly

what node a query can run on. It means that each query is running on a small set of data,

making them much cheaper and faster.

Model Column Families Around Query Patterns

As discussed earlier, No SQL data modeling is always based on query patterns; however,

it is also important to understand the business context behind the objects of interest:

hence, start your design with entities and relationships, if you can. Unlike in relational

databases, it's not easy to tune or introduce new query patterns in by simply creating

secondary indexes or building complex SQLs (using joins, order by, group by) because

of its high-scale distributed nature. So think about query patterns up front, and design

column families accordingly.

Entities and their relationships still matter (unless the use case is special, perhaps

storing logs or other time series data). What if you are given query patterns to create a

data model for an e-commerce website, but you were not told anything about the entities

and relationships? You might try to figure out entities and relationships, knowingly or

Search WWH ::

Custom Search

Home