Storing and Processing Time Series Data - Time Series Databases

Database Reference

In-Depth Information

an effective and simple, modern format that can store the time and a number of optional val-

ues. Figure 3-1 shows two possible Parquet schemas for recording time series. The schema

on the left is suitable for special-purpose storage of time series data where you know what

measurements are plausible. In the example on the left, only the four time series that are ex-

plicitly shown can be stored (tempIn, pressureIn, tempOut, pressureOut). Adding another

time series would require changing the schema. The more abstract Parquet schema on the

right in Figure 3-1 is much better for cases where you may want to embed more metadata

about the time series into the data file itself. Also, there is no a priori limit on the number or

names of different time series that can be stored in this format. The format on the right would

be much more appropriate if you were building a time series library for use by other people.

Figure 3-1. Two possible schemas for storing time series data in Parquet. The schema on the left

embeds knowledge about the problem domain in the names of values. Only the four time series

shown can be stored without changing the schema. In contrast, the schema on the right is more

flexible; you could add additional time series. It is also a bit more abstract, grouping many samples

for a single time series into a single block.

Such a simple implementation of a time series—especially if you use a file format like Par-

quet—can be remarkably serviceable as long as the number of time series being analyzed is

relatively small and as long as the time ranges of interest are large with respect to the parti-

tioning time for the flat files holding the data.

While it is fairly common for systems to start out with a flat file implementation, it is also

common for the system to outgrow such a simple implementation before long. The basic

problem is that as the number of time series in a single file increases, the fraction of usable

data for any particular query decreases, because most of the data being read belongs to other

time series.

Likewise, when the partition time is long with respect to the average query, the fraction of

usable data decreases again since most of the data in a file is outside the time range of in-

terest. Efforts to remedy these problems typically lead to other problems. Using lots of files

Search WWH ::

Custom Search

Home