Time Series Data - HBase Design Patterns

Database Reference

In-Depth Information

The lesson that we learn here is to use composite keys, take care of load balancing by

adding the UID at the beginning, put the timestamp in the key, and add additional

information at the end. Note that, ordering is done inside the composite key, thus

reflecting the types of queries we anticipate.

The timestamp

The timestamp is a Unix epoch value in seconds, encoded on 4 bytes. Rows are

broken up into hour increments, reflected by the timestamp in each row. Thus, each

timestamp will be normalized to an hour value, for example, 2013-01-01 08:00:00 .

This is to avoid stuffing too many data points in a single row as that would affect

region distribution. However, note that it can result in a large number of data points

if the frequency of data generation is high.

Also, since HBase sorts the data on the row key, the data for the same metric and

time bucket, but with different tags, will be grouped together for efficient queries.

This assumes that the number of tags is small, and indeed OpenTSDB limits it to

eight tags.

When storing time series data, implement the following best practices:

• Store a reasonable time interval per row. The amount of data should not

make the table too tall and thin or too narrow and wide. One hour was

chosen here.

• Use tags to store the time interval designation.

• Use your own data encoding, since we deal with binary data here.

• Take advantage of the natural sorting of columns in the row.

• Design for efficient access.

Compactions

Why is compaction required? The answer is to reduce the storage (as the key is

repeatedly stored for each column). If compactions have been enabled for a TSD,

a row might be compacted after its base hour has passed or a query has run over

the row. The lesson here is that in your design, keep the compactions, both minor

and major, in mind, because they will affect the performance.

Search WWH ::

Custom Search

Home