Database Reference
In-Depth Information
an approximate histogram, or produce a set of fine-grain, equal-width bins and
coalesce the fine bins into the desired number of nearly equal-weight bins.
The multilevel bitmap indexes are composed of multiple bitmap indexes,
each one corresponding to a different granularity of binning. To make it easier
to reuse information from different levels of binning, we ensure that the bin
boundaries from coarser levels are a subset of those used for the next finer
level of bins. This generates a hierarchy of bins. To minimize the average
query processing cost, the multilevel bitmap indexes mentioned in the previous
subsection always use equal-weight bins for the coarse levels. These indexes all
use two levels of bins, with the fine level having one bitmap for each distinct
value. We consider such indexes to be precise indexes because they can answer
any queries with the bitmaps, without accessing the base data.
Even though binning can reduce the number of bitmaps and improve the
query response time in many cases, for some queries, however, we have to go
back to the base data to answer the queries accurately. For example, if we have
100 equal-width bins for pressure between 0 and 10 9 , then the query condition
“pressure
10 7 ” can be resolved with the index only. We know bins 0
and 1 contain records that satisfy the query condition, and bins 3 and onward
contain records that do not satisfy the condition, but we are not sure which
records in bin 2 satisfy the condition. We need to examine the actual values of
all records in bin 2 to decide. In this case, we say that bin 2 is the boundary bin
of the query and call the records in bin 2 candidates of the query. The process
of examining the raw data to resolve the query accurately is called candi-
date checking . When a candidate check is needed, it often dominates the total
query response time. There are a number of different approaches to minimize
the impact of candidate checks. One approach is to reorder the expression
being evaluated to minimize the overall cost of candidate checks. 88 Another
approach is to place the bin boundaries to minimize the cost of evaluating a
fixed set of queries. 48 , 75 - 77
Both approaches mentioned above do not actually reduce the cost of a
candidate-checking operation. More recently, a new approach was proposed
to do just that. 111 It does so by providing a clustered copy named order-
preserving bin-based clustering (OrBiC) of the base data. Since the values of
all records in a bin are organized contiguously, the time needed for a candidate-
checking operation is minimized. In tests, this approach was shown to signifi-
cantly outperform the unbinned indexes. Figure 6.18 shows some performance
numbers to illustrate the key advantages of the new approach. In Figure 6.18a,
we see that the binned index with OrBiC outperforms the one without OrBiC
for all query conditions tested. In Figure 6.18b, we see how the advantage
of OrBiC varies with the number of bins. Clearly, we see that the advantage
of OrBiC is significantly affected by the number of bins used. The analysis
provided by the authors can predict the optimal number of bins for simple
types of data, 111 but additional work is needed to determine the number of
bins for more realistic data.
>
2
.
5
×
Search WWH ::




Custom Search