Storing and Processing Time Series Data - Time Series Databases

Database Reference

In-Depth Information

problem of having lots of different time series and can work reasonably well up to levels of

hundreds of millions or billions of data points. As we saw in Chapter 1 , however, even 19th

century shipping data produced roughly a billion data points. As of 2014, the NASDAQ

stock exchange handles a billion trades in just over three months. Recording the operating

conditions on a moderate-sized cluster of computers can produce half a billion data points in

a day.

Moreover, simply storing the data is one thing; retrieving it and processing it is quite another.

Modern applications such as machine learning systems or even status displays may need to

retrieve and process as many as a million data points in a second or more.

While relational systems can scale into the lower end of these size and speed ranges, the

costs and complexity involved grows very fast. As data scales continue to grow, a larger and

larger percentage of time series applications just don't fit very well into relational databases.

Using the star schema but changing to a NoSQL database doesn't particularly help, either,

because the core of the problem is in the use of a star schema in the first place, not just the

amount of data.

NoSQL Database with Wide Tables

The core problem with the star schema approach is that it uses one row per measurement.

One technique for increasing the rate at which data can be retrieved from a time series data-

base is to store many values in each row. With some NoSQL databases such as Apache

HBase or MapR-DB, the number of columns in a database is nearly unbounded as long as

the number of columns with active data in any particular row is kept to a few hundred thou-

sand. This capability can be exploited to store multiple values per row. Doing this allows

data points to be retrieved at a higher speed because the maximum rate at which data can be

scanned is partially dependent on the number of rows scanned, partially on the total number

of values retrieved, and partially on the total volume of data retrieved. By decreasing the

number of rows, that part of the retrieval overhead is substantially cut down, and retrieval

rate is increased. Figure 3-3 shows one way of using wide tables to decrease the number of

rows used to store time series data. This technique is similar to the default table structure

used in OpenTSDB, an open source database that will be described in more detail in

Chapter 4 . Note that such a table design is very different from one that you might expect to

use in a system that requires a detailed schema be defined ahead of time. For one thing, the

number of possible columns is absurdly large if you need to actually write down the schema.

Search WWH ::

Custom Search

Home