Data Modeling - Practical Cassandra

Database Reference

In-Depth Information

SELECT * FROM events;

id | time | event_type | data

...+................+............+................

1 | 16:41:33.90814 |

4 | some event data

2 | 16:59:48.9131 |

2 | some event data

3 | 17:12:12.12758 |

4 | some event data

4 | 17:32:17.83765 |

1 | some event data

5 | 17:48:57.10934 |

0 | some event data

To get around the lack of joins, we can just store the event_type value in

the column every time. This denormalization of data is certainly not frowned upon

when using Cassandra. Modeling for Cassandra should follow the idea that disk

space is cheap, so duplicating data (even multiple times) should not be an issue. In

fact, “normalization” of data is an anti-pattern in Cassandra. The Cassandra model

would be similar; however, there are a few key differences that will make a world

of difference in performance and usability.

Listing 3.4 shows an exact copy of the relational version, but storing each event

type as the value rather than the ID of a relation. The primary concern with this

model is that each row has a single event. This makes it very difficult to find events

that belong to a particular time or event type. Basically, you have to know the ID

of the event to get its information without doing a full scan of the ColumnFamily.

Listing 3.4 Example of Cassandra Data Model for Log Storage (Copy of

RDBMS)

CREATE TABLE events (

id UUID PRIMARY KEY,

time TIMESTAMP,

event_type TEXT,

data text

);

We can solve some of the issues here with indexes; however, the code would

not be as performant as it could be. Let's say you would like to get all events for a

particular hour. We can easily add an index to the time field, but this will cause

excessive load, as every event will need to be pulled from different rows scattered

about the cluster.

Search WWH ::

Custom Search

Home