Database Reference
In-Depth Information
full programming language. The first-class presence of Scala in Spark SQL programs makes
it much easier to manipulate time series data from Open TSDB.
In particular, with Spark SQL, you can use an RDD (resilient distributed dataset) directly as
an input, and that RDD can be populated by any convenient method. That means that you
can use a range scan to read a number of rows from the Open TSDB data tier in either HBase
or MapR-DB into memory in the form of an RDD. This leaves you with an RDD where the
keys are row keys and the values are HBase Result structures. The getFamilyMap method
can then be used to get all columns and cell values for that row. These, in turn, can be emit-
ted as tuples that contain metric, timestamp, and value. The flatmap method is useful here
because it allows each data row to be transformed into multiple time series samples.
You can then use any SQL query that you like directly on these tuples as stored in the result-
ing RDD. Because all processing after reading rows from HBase is done in memory and in
parallel, the processing speed will likely be dominated by the cost of reading the rows of data
from the data tier. Furthermore, in Spark, you aren't limited by language. If SQL isn't con-
venient, you can do any kind of Spark computation just as easily.
A particularly nice feature of using Spark to analyze metric data is that the framework
already handles most of what you need to do in a fairly natural way. You need to write code
to transform OpenTSDB rows into samples, but this is fairly straightforward compared to ac-
tually extending the platform by writing an input format or data storage module from scratch.
Why Not Apache Hive?
Analyzing time series data from OpenTSDB using Hive is much more difficult than it is with
Spark. The core of the problem is that the HBase storage engine for Hive requires that data
be stored using a very standard, predefined schema. With Open TSDB, the names of the
columns actually contain the time portion of the data, and it isn't possible to write a fully
defined schema to describe the tables. Not only are there a large number of possible columns
(more than 10 1000 ), but the names of the columns are part of the data. Hive doesn't like that,
so this fact has to be hidden from it.
The assumptions of Hive's design are baked in at a pretty basic level in the Hive storage en-
gine, particularly with regard to the assumption that each column in the database represents a
single column in the result. The only way to have Hive understand OpenTSDB is to clone
and rewrite the entire HBase storage engine that is part of Hive. At that point, each row of
data from the OpenTSDB table can be returned as an array of tuples containing one element
for time and one for value. Each such row can be exploded using a lateral view join.
Search WWH ::




Custom Search