Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

an approximate histogram, or produce a set of fine-grain, equal-width bins and

coalesce the fine bins into the desired number of nearly equal-weight bins.

The multilevel bitmap indexes are composed of multiple bitmap indexes,

each one corresponding to a different granularity of binning. To make it easier

to reuse information from different levels of binning, we ensure that the bin

boundaries from coarser levels are a subset of those used for the next finer

level of bins. This generates a hierarchy of bins. To minimize the average

query processing cost, the multilevel bitmap indexes mentioned in the previous

subsection always use equal-weight bins for the coarse levels. These indexes all

use two levels of bins, with the fine level having one bitmap for each distinct

value. We consider such indexes to be precise indexes because they can answer

any queries with the bitmaps, without accessing the base data.

Even though binning can reduce the number of bitmaps and improve the

query response time in many cases, for some queries, however, we have to go

back to the base data to answer the queries accurately. For example, if we have

100 equal-width bins for pressure between 0 and 10 9 , then the query condition

“pressure

10 7 ” can be resolved with the index only. We know bins 0

and 1 contain records that satisfy the query condition, and bins 3 and onward

contain records that do not satisfy the condition, but we are not sure which

records in bin 2 satisfy the condition. We need to examine the actual values of

all records in bin 2 to decide. In this case, we say that bin 2 is the boundary bin

of the query and call the records in bin 2 candidates of the query. The process

of examining the raw data to resolve the query accurately is called candi-

date checking . When a candidate check is needed, it often dominates the total

query response time. There are a number of different approaches to minimize

the impact of candidate checks. One approach is to reorder the expression

being evaluated to minimize the overall cost of candidate checks. 88 Another

approach is to place the bin boundaries to minimize the cost of evaluating a

fixed set of queries. 48 , 75 - 77

Both approaches mentioned above do not actually reduce the cost of a

candidate-checking operation. More recently, a new approach was proposed

to do just that. 111 It does so by providing a clustered copy named order-

preserving bin-based clustering (OrBiC) of the base data. Since the values of

all records in a bin are organized contiguously, the time needed for a candidate-

checking operation is minimized. In tests, this approach was shown to signifi-

cantly outperform the unbinned indexes. Figure 6.18 shows some performance

numbers to illustrate the key advantages of the new approach. In Figure 6.18a,

we see that the binned index with OrBiC outperforms the one without OrBiC

for all query conditions tested. In Figure 6.18b, we see how the advantage

of OrBiC varies with the number of bins. Clearly, we see that the advantage

of OrBiC is significantly affected by the number of bins used. The analysis

provided by the authors can predict the optimal number of bins for simple

types of data, 111 but additional work is needed to determine the number of

bins for more realistic data.

>

2

.

5

×

Scientific Data Management

Search WWH ::

Custom Search

Home