Time Series Data - HBase Design Patterns

Database Reference

In-Depth Information

(The source of this image is http://ikaisays.com/2011/01/25/app-engine-

datastore-tip-monotonically-increasing-values-are-bad/ , and the image

has been used with their due permission.)

Actually, Ikai is talking in terms of Google's BigTable, the design on which HBase is

based. BigTable shards (divides the load) not only when it outgrows the region size

but also when it sees a disproportionately high load on one table. As far as I know,

HBase does not have such a provision. All the more, we should be careful not to

concentrate our keys. So, what should you do?

You can do one of the following things:

• Avoid indices if you can : In your case, you will have to record the row using

something else instead of time as the key. This is not the perfect approach in

the case of well logs, since you need to know when the values were recorded.

The depth of the measurement will give you the same problem.

• Randomize your writes : Even pseudorandomization will help you to

offload some of the servers. Again, in our case, the sorted time is essential

information; we will be hard pressed to do without it. The two pieces of

advice given so far might be good for other situations (such as writing

dictionary words in a random sequence), but not for true time series data.

Your writes will be faster, but your reads will be slower, because you will

have to collect the information from many places.

• Prefix a shard identifier to your key : You can distribute the load between

multiple servers yourself. Now, when you are reading the data back, you

will have to read it from each server, prefix all the possible server numbers

to your time, and combine the query results in memory. A bit of a bother,

but this will work.

The three pieces of the preceding advice are actually good, each for their own

situation. It's just that for time series data, only the last one is practical. We will see

how to write the code for this later in this chapter.

A partial solution to the problem can be provided by preloading, which we will

discuss later. In brief, preloading, or more technically bulk loading, comes into play

when you already have a lot of data to write to HBase. In this case, you can select the

number of regions to be created. Then, all of them are used. If you combine this with

the real read/write load, you might be lucky and your workload might already be

distributed between the regions (shards) that you have previously created.

Search WWH ::

Custom Search

Home