Database Reference
In-Depth Information
shows some timing measurements to support the analysis. In this case, we
see that two-level encodings (equality-equality encoding EE, range-equality
encoding RE, and interval-equality encoding IE) can be as much as ten times
faster than the basic bitmap index (marked E1). On the average, the two-
level encoded indexes are about three to five times faster than both the basic
bitmap index and the binary encoded index (BN). 110
6.6.4 Binning
Scientific data often contains floating-point values with extremely high col-
umn cardinality. For example, the temperature and pressure in a combustion
simulation can take on a large range of possible values, and each value rarely
repeats in a dataset. The basic bitmap index will generate many millions of
bitmaps in a typical dataset. Such indexes are typically large and slow to work,
even with the best compression and bitmap encoding. We observed that such
precise indexing is often unnecessary since the applications do not usually
demand full precision. For example, a typical query involving pressure is of
the form “pressure
10 7 Pascal.” In this case, the constant in the query
expression has only one significant digit. Often, such constants have no more
than a few significant digits. One may take advantage of this observation and
significantly reduce the number of bitmaps used in a bitmap index.
In general, the technique of grouping many values together is called
binning . 48 , 83 , 88 , 105 The values placed in a bin are not necessarily consecutive
values. 48 However, the most common forms of binning always place values
from a consecutive range into a bin. For example, if the valid pressure values
are in the range between 0 and 10 9 , we may place values between 0 and 1 in
the first bin, values between 1 and 10 in the second bin, values between 10 and
100 in the third bin, and so on. This particular form of binning is commonly
known as logarithmic binning. To produce a binned index that will answer
all range conditions using one-digit query boundaries, we can place all values
that round to the same one-digit number into a bin. *
A simple way to divide all pressure values between 0 and 10 9 into 100 bins
would be to place all values between i
>
2
×
10 7 in bin i .We
call them equal-width bins . Since each equal-width bin may contain a different
number of records, the corresponding bitmaps will have varying sizes, and the
amount of work associated with them will be different too. One way to reduce
this variance is to make sure that each bin has the same number of records. We
call such bins equal-weight bins . To produce such equal-weight bins, we need to
first find the number of occurrences for each value. Computing such detailed
histograms may take a long time and a large amount of memory, because there
may be a large number of distinct values. We can sample the data to produce
10 7
×
and
(
i
+
1
) ×
* A caveat: We actually split all values that round to a low-precision number x into two bins: one
for those that round up to x and one for those that round down to x .
Search WWH ::




Custom Search