Database Reference
In-Depth Information
problem of having lots of different time series and can work reasonably well up to levels of
hundreds of millions or billions of data points. As we saw in Chapter 1 , however, even 19th
century shipping data produced roughly a billion data points. As of 2014, the NASDAQ
stock exchange handles a billion trades in just over three months. Recording the operating
conditions on a moderate-sized cluster of computers can produce half a billion data points in
a day.
Moreover, simply storing the data is one thing; retrieving it and processing it is quite another.
Modern applications such as machine learning systems or even status displays may need to
retrieve and process as many as a million data points in a second or more.
While relational systems can scale into the lower end of these size and speed ranges, the
costs and complexity involved grows very fast. As data scales continue to grow, a larger and
larger percentage of time series applications just don't fit very well into relational databases.
Using the star schema but changing to a NoSQL database doesn't particularly help, either,
because the core of the problem is in the use of a star schema in the first place, not just the
amount of data.
NoSQL Database with Wide Tables
The core problem with the star schema approach is that it uses one row per measurement.
One technique for increasing the rate at which data can be retrieved from a time series data-
base is to store many values in each row. With some NoSQL databases such as Apache
HBase or MapR-DB, the number of columns in a database is nearly unbounded as long as
the number of columns with active data in any particular row is kept to a few hundred thou-
sand. This capability can be exploited to store multiple values per row. Doing this allows
data points to be retrieved at a higher speed because the maximum rate at which data can be
scanned is partially dependent on the number of rows scanned, partially on the total number
of values retrieved, and partially on the total volume of data retrieved. By decreasing the
number of rows, that part of the retrieval overhead is substantially cut down, and retrieval
rate is increased. Figure 3-3 shows one way of using wide tables to decrease the number of
rows used to store time series data. This technique is similar to the default table structure
used in OpenTSDB, an open source database that will be described in more detail in
Chapter 4 . Note that such a table design is very different from one that you might expect to
use in a system that requires a detailed schema be defined ahead of time. For one thing, the
number of possible columns is absurdly large if you need to actually write down the schema.
Search WWH ::




Custom Search