Physical Database Design - Automated Physical Database Design and Tuning

Database Reference

In-Depth Information

such scenarios, the good properties of various indexes need to be quantitatively

balanced, as it becomes challenging to understand how indexes interact to

answer a given query. In general, some form of quantitative and systematic

analysis becomes crucial for an effective physical design, as illustrated next

by using examples.

3.1.2 Systematic Physical Database Design

Consider a query that returns tuples satisfying a range predicate from a single

table R :

Q1 = SELECT A, B, C

FROM R

WHERE 10<A<20

Suppose that we want to obtain the best configuration of indexes for such

a query. In this case, we can show that an index I 1 =

is optimal.

In fact, we can seek I 1 to obtain all tuples that satisfy the range predicate

on column A . Additionally, index I 1 includes columns B and C , which together

cover the query and avoid additional RID lookups. With the exception of the

initial traversal of index I 1 to obtain the first tuple that satisfies the predicate,

the plan that uses I 1 accesses exactly the tuples that constitute the answer

for the query and therefore cannot be improved.

Now consider a slight variation of Q1 , with an additional range predicate

on column B :

R

(

A

|

B

,

C

)

Q2 = SELECT A, B, C

FROM R

WHERE 10<A<20

AND20<B<100

A plan using the previous index I 1 would seek the index and obtain all tuples

that satisfy predicate 10<A<20 . From this intermediate result, all tuples

that do not satisfy predicate 20<B<100 would be eliminated. Note that

in this case we do not scan just the query result. If most of the tuples satisfy

the predicate on A but just a handful do so for the predicate on B , we would

read many unnecessary tuples from disk, thus wasting several inputs/outputs

(I/Os). Analogously, a strategy using index I 2 =

would first obtain

all tuples satisfying predicate 20<B<100 and then filter those not satisfy-

ing the predicate on A . It can be shown that for any data distribution, either

I 1 or I 2 results in the better plan, but it is not immediate which one is better

for any specific instance. If we have access to statistics on base tables (e.g.,

histograms as discussed in Chapter 2) and a rough understanding of the cost

model of the optimizer, we can calculate, for any data instance, which alter-

native between I 1 and I 2 is better. Specifically, we should choose the index

whose key column corresponds to the more selective predicate.

R

(

B

|

A

,

C

)

Automated Physical Database Design and Tuning

Search WWH ::

Custom Search

Home