Database Reference
In-Depth Information
an effective and simple, modern format that can store the time and a number of optional val-
ues. Figure 3-1 shows two possible Parquet schemas for recording time series. The schema
on the left is suitable for special-purpose storage of time series data where you know what
measurements are plausible. In the example on the left, only the four time series that are ex-
plicitly shown can be stored (tempIn, pressureIn, tempOut, pressureOut). Adding another
time series would require changing the schema. The more abstract Parquet schema on the
right in Figure 3-1 is much better for cases where you may want to embed more metadata
about the time series into the data file itself. Also, there is no a priori limit on the number or
names of different time series that can be stored in this format. The format on the right would
be much more appropriate if you were building a time series library for use by other people.
Figure 3-1. Two possible schemas for storing time series data in Parquet. The schema on the left
embeds knowledge about the problem domain in the names of values. Only the four time series
shown can be stored without changing the schema. In contrast, the schema on the right is more
flexible; you could add additional time series. It is also a bit more abstract, grouping many samples
for a single time series into a single block.
Such a simple implementation of a time series—especially if you use a file format like Par-
quet—can be remarkably serviceable as long as the number of time series being analyzed is
relatively small and as long as the time ranges of interest are large with respect to the parti-
tioning time for the flat files holding the data.
While it is fairly common for systems to start out with a flat file implementation, it is also
common for the system to outgrow such a simple implementation before long. The basic
problem is that as the number of time series in a single file increases, the fraction of usable
data for any particular query decreases, because most of the data being read belongs to other
time series.
Likewise, when the partition time is long with respect to the average query, the fraction of
usable data decreases again since most of the data in a file is outside the time range of in-
terest. Efforts to remedy these problems typically lead to other problems. Using lots of files
Search WWH ::




Custom Search