Effective CQL - Mastering Apache Cassandra

Database Reference

In-Depth Information

The index table will look like this:

Baghdad | 5 |

-------------------+-----------------------------------------

Cape Town | 1 | 4 |

-------------------+-----------------------------------------

Kabul | 2 | 6 |

-------------------+-----------------------------------------

Mogadishu | 3 |

-------------------+-----------------------------------------

So, indexes are basically wide-rows or dynamic rows with row keys as values in the

column being indexed (remember wide rows from the previous section?). These indexes

are stored locally, which means an index table on node X will have an index created only

for the rows that exist on node X. This means, every search that includes an indexed

column makes a query on the index table on all the nodes, but do not worry about it, it is

fast. The other things that you should keep in mind are as follows:

• Avoid high cardinality columns : Cardinality is the number of unique values. In

our user's case, we have 40,000 or fewer cities across the world. If you have 40

million users, you have 1,000 users per city on average. If you have 25 nodes, you

are making 25 reads (one on each node) from the index table, and then 1,000

reads for 1,000 rows. This looks okay; 1,025 reads for 1,000 records is about one

read per record. Cities compared to users have low cardinality, so this helps us to

optimize a read request. What happens when we use something that has high car-

dinality? Let's say we use the user's phone number as the index. Each user has a

unique phone number. So, it is likely that when we search by phone number, we

will get at most one record. The read process makes 25 read requests to the nodes

to look into their local indexes; then if it exists, it will make one read request to

the node that has the row. So, we are basically making 26 read requests read one

row.

We have seen that high cardinality columns are not good for indexes and hence,

we should avoid them. It may be acceptable if it is a very low volume query (ac-

cessed once in a while), but it may be wise to create another table manually that

has a phone number as the row key (primary key) and one of the columns as

user_id from the users table. This way, we will make two read request fetch

one record. The downside of this approach is, it is an extra effort on the developer

side.

Search WWH ::

Custom Search

Home