Practical Time Series Tools - Time Series Databases

Database Reference

In-Depth Information

full programming language. The first-class presence of Scala in Spark SQL programs makes

it much easier to manipulate time series data from Open TSDB.

In particular, with Spark SQL, you can use an RDD (resilient distributed dataset) directly as

an input, and that RDD can be populated by any convenient method. That means that you

can use a range scan to read a number of rows from the Open TSDB data tier in either HBase

or MapR-DB into memory in the form of an RDD. This leaves you with an RDD where the

keys are row keys and the values are HBase Result structures. The getFamilyMap method

can then be used to get all columns and cell values for that row. These, in turn, can be emit-

ted as tuples that contain metric, timestamp, and value. The flatmap method is useful here

because it allows each data row to be transformed into multiple time series samples.

You can then use any SQL query that you like directly on these tuples as stored in the result-

ing RDD. Because all processing after reading rows from HBase is done in memory and in

parallel, the processing speed will likely be dominated by the cost of reading the rows of data

from the data tier. Furthermore, in Spark, you aren't limited by language. If SQL isn't con-

venient, you can do any kind of Spark computation just as easily.

A particularly nice feature of using Spark to analyze metric data is that the framework

already handles most of what you need to do in a fairly natural way. You need to write code

to transform OpenTSDB rows into samples, but this is fairly straightforward compared to ac-

tually extending the platform by writing an input format or data storage module from scratch.

Why Not Apache Hive?

Analyzing time series data from OpenTSDB using Hive is much more difficult than it is with

Spark. The core of the problem is that the HBase storage engine for Hive requires that data

be stored using a very standard, predefined schema. With Open TSDB, the names of the

columns actually contain the time portion of the data, and it isn't possible to write a fully

defined schema to describe the tables. Not only are there a large number of possible columns

(more than 10 1000 ), but the names of the columns are part of the data. Hive doesn't like that,

so this fact has to be hidden from it.

The assumptions of Hive's design are baked in at a pretty basic level in the Hive storage en-

gine, particularly with regard to the assumption that each column in the database represents a

single column in the result. The only way to have Hive understand OpenTSDB is to clone

and rewrite the entire HBase storage engine that is part of Hive. At that point, each row of

data from the OpenTSDB table can be returned as an array of tuples containing one element

for time and one for value. Each such row can be exploded using a lateral view join.

Search WWH ::

Custom Search

Home