Database Reference
In-Depth Information
(The source of this image is http://ikaisays.com/2011/01/25/app-engine-
datastore-tip-monotonically-increasing-values-are-bad/ , and the image
has been used with their due permission.)
Actually, Ikai is talking in terms of Google's BigTable, the design on which HBase is
based. BigTable shards (divides the load) not only when it outgrows the region size
but also when it sees a disproportionately high load on one table. As far as I know,
HBase does not have such a provision. All the more, we should be careful not to
concentrate our keys. So, what should you do?
You can do one of the following things:
Avoid indices if you can : In your case, you will have to record the row using
something else instead of time as the key. This is not the perfect approach in
the case of well logs, since you need to know when the values were recorded.
The depth of the measurement will give you the same problem.
Randomize your writes : Even pseudorandomization will help you to
offload some of the servers. Again, in our case, the sorted time is essential
information; we will be hard pressed to do without it. The two pieces of
advice given so far might be good for other situations (such as writing
dictionary words in a random sequence), but not for true time series data.
Your writes will be faster, but your reads will be slower, because you will
have to collect the information from many places.
Prefix a shard identifier to your key : You can distribute the load between
multiple servers yourself. Now, when you are reading the data back, you
will have to read it from each server, prefix all the possible server numbers
to your time, and combine the query results in memory. A bit of a bother,
but this will work.
The three pieces of the preceding advice are actually good, each for their own
situation. It's just that for time series data, only the last one is practical. We will see
how to write the code for this later in this chapter.
A partial solution to the problem can be provided by preloading, which we will
discuss later. In brief, preloading, or more technically bulk loading, comes into play
when you already have a lot of data to write to HBase. In this case, you can select the
number of regions to be created. Then, all of them are used. If you combine this with
the real read/write load, you might be lucky and your workload might already be
distributed between the regions (shards) that you have previously created.
 
Search WWH ::




Custom Search