Database Reference
In-Depth Information
Now consider the following query, which is very similar to Q2 but has a
slightly more complex predicate in the WHERE clause:
Q3 = SELECT B, C, D
FROM R
WHERE 50<B<100
AND61<2*D+1<81
On the surface, Q2 and Q3 are not very different. We can rewrite the pred-
icate on D to 30<D<40 and then, applying the same argument as in the
previous example, choose between I 3 =
) based
on statistical information on columns B and D . However, we are making the
additional assumption that the optimizer would recognize the algebraic ma-
nipulation that equates 61<2*D+1<81 and 30<D<40 . If the op-
timizer is unable to derive and leverage this identity, it would not be able to
use index I 4 to seek appropriate tuples (even when we know it is the best al-
ternative). In this scenario, if only I 4 is present, the optimizer would perform
a clustered index scan to answer the query, which in general would be worse
than the index-based alternative leveraging index I 3 .
To make the running example a bit more complex, suppose that our work-
load consists of both Q2 and Q3 . Then, the best configuration would be the
union of the best index for each query (since the workload does not have
updates, there is no performance penalty in adding indexes). Suppose that,
based on statistical information and knowledge about the optimizer's sim-
plification logic, we determine that the best configuration is I 1 =
R
(
B
|
C
,
D
) and I 4 =
R
(
D
|
C
,
B
R
(
A
|
B
,
C
)
=
(
|
,
)
and I 4
. Assume now that the storage required by both I 1 and
I 4 is unacceptably high (e.g., it might not fit in the available storage). In
this case, the situation is considerably more dicult. We can always decide
to drop either I 1 or I 4 (specifically, the one that results in the smaller im-
provement) and obtain a configuration that, at least, benefits one of the two
queries. However, consider an index like I 5 =
R
D
B
C
) , which is smaller
in size than I 1 and I 4 combined and thus might fit in the available storage.
This index is slightly worse than I 1 =
R
(
A
|
B
,
C
,
D
) for Q2 because of the addi-
tional included column D , which makes it wider and therefore more expensive
to scan. While we cannot seek I 5 for Q3 because its key column A does not
occur in the WHERE clause of Q3 , I 3 can still be scanned much more eciently
than the clustered index (it contains fewer columns). The net effect is that,
although I 5 is suboptimal for both Q2 and Q3 , it might still be better than
having a configuration with a single index that is optimal for either Q2 or
Q3 . Of course, I 6
R
(
A
|
B
,
C
should also be considered, since it helps
Q3 almost as much as the optimal I 4 and is better for Q2 than the plan us-
ing a clustered index. To make matters worse, other interesting alternatives
are possible as well. Consider the scenario where the available storage is even
more constrained and cannot accommodate index I 5 . In this situation, indexes
R
=
R
(
D
|
A
,
B
,
C
)
, which combined are smaller than I 5 , can be useful if I 5 itself
is too big. Depending on the number of tuples in the result of Q2 and Q3 , using
(
A
)
and R
(
D
)
Search WWH ::




Custom Search