Database Reference
In-Depth Information
SELECT * FROM events;
id | time | event_type | data
...+................+............+................
1 | 16:41:33.90814 |
4 | some event data
2 | 16:59:48.9131 |
2 | some event data
3 | 17:12:12.12758 |
4 | some event data
4 | 17:32:17.83765 |
1 | some event data
5 | 17:48:57.10934 |
0 | some event data
To get around the lack of joins, we can just store the event_type value in
the column every time. This denormalization of data is certainly not frowned upon
when using Cassandra. Modeling for Cassandra should follow the idea that disk
space is cheap, so duplicating data (even multiple times) should not be an issue. In
fact, “normalization” of data is an anti-pattern in Cassandra. The Cassandra model
would be similar; however, there are a few key differences that will make a world
of difference in performance and usability.
Listing 3.4 shows an exact copy of the relational version, but storing each event
type as the value rather than the ID of a relation. The primary concern with this
model is that each row has a single event. This makes it very difficult to find events
that belong to a particular time or event type. Basically, you have to know the ID
of the event to get its information without doing a full scan of the ColumnFamily.
Listing 3.4 Example of Cassandra Data Model for Log Storage (Copy of
RDBMS)
CREATE TABLE events (
id UUID PRIMARY KEY,
time TIMESTAMP,
event_type TEXT,
data text
);
We can solve some of the issues here with indexes; however, the code would
not be as performant as it could be. Let's say you would like to get all events for a
particular hour. We can easily add an index to the time field, but this will cause
excessive load, as every event will need to be pulled from different rows scattered
about the cluster.
 
Search WWH ::




Custom Search