Practical Time Series Tools - Time Series Databases

Database Reference

In-Depth Information

Cache Coherency Through Restart Logs

Ultimately, it is likely to be desirable to allow multiple TSDs to be run at the same time and

still use the in-memory caching for performance. This however, leads to a situation where

new data points and requests for existing data could go to any TSD at all. In order to ensure

that all TSDs have consistent views of all data, we need to have a cache coherency protocol

where all new data accepted by any TSD has a very high likelihood of being present on every

TSD very shortly after it arrives.

In order to do this simply, we require all TSDs to write restart logs that contain a record of all

the transactions that they have received as well as a record of exactly when blobs are written

to the storage tier. All TSDs can then read the restart logs of all of the other TSDs. This will

help in two ways. First, all TSDs, including those recently started, will have very nearly

identical memory states. Secondly, only one TSD will actually write each row to the data-

base. Such a design avoids nearly all coordination at the cost of requiring that all recent data

points be kept in multiple TSD memory spaces.

This design requires that a TSD be able to read restart logs and modify its in-memory repres-

entation at the full production data rate, possibly using several hardware threads. Since re-

start logs are kept in conventional flat files, reading the data in a binary format at high rates

is not a problem. Likewise, since the cache is kept in memory, updating the cache at more

than a million updates per second is likewise not a major problem.

The only remaining issue is to arrange for only one TSD to write each row to the database.

This can be done by having each TSD pick a random time to wait before writing an idle dirty

row back to the database. When the TSD starts the write, it will write a start transaction to

the log, and when it completes the write, it will write a finish transaction to the log. When

other TSDs read the finish transaction from the first TSD's restart log, they will silently dis-

card the dirty row if their last update time matches the update time that was written to the

database. Any TSD that reads a start transaction will delay its own write time for that row by

a few seconds to allow the finish operation to arrive. By setting the range of the random

times large with respect to the time required to propagate the start transaction, the probability

that two TSDs will start a write on the same row can be made very, very small. Even if two

TSDs do decide to write the same row to the database, row updates are atomic, so the two

processes will write the same data (since the row is idle at this point). The net effect is that

each row will almost always be written to the database only once, on average.

With an understanding of basic concepts related to building a large scale, NoSQL time series

database provided by Chapter 3 and the exploration of open source tools to implement those

ideas, as described here in Chapter 4 , you should now be well prepared to tackle your own

Search WWH ::

Custom Search

Home