Query Optimization in Relational Database Systems - Automated Physical Database Design and Tuning

Database Reference

In-Depth Information

Consequently, using a nonclustered index over R

.

a , the resulting plan first re-

trieves the RIDs of all tuples in R that satisfy R

20. Then, using lookups

against table R , it looks up the actual tuples that correspond to those RIDs.

Finally, it performs a nested-loop join between the subset of tuples of R calcu-

lated before, and table S , which is sequentially scanned. For the case @ C =2000

in Figure 2.4c, the optimizer estimates that the number of tuples of R satisfy-

ing R

.

a

<

< 2000 is rather large. The resulting plan therefore scans both tables

sequentially (discarding on the fly the tuples from R that do not satisfy the

condition R

.

a

< 2000) and then performs a hash join to obtain the result (in

this scenario, the lookups of the previous plan would have been too numerous

and therefore too expensive). Finally, Figure 2.4b shows yet another execution

plan that is chosen when the number of tuples of R satisfying the predicate is

neither too small nor too large. In this case, table S is scanned in increasing

order of S

.

a

y using a clustered index, and table R is first scanned sequentially

(discarding invalid tuples on the fly as before) and then sorted by R

.

x . Finally,

a merge join is performed on the two intermediate results.

The previous examples illustrate that the cardinality of intermediate results

plays a fundamental role in the choice of execution plans. We next discuss how

cardinality estimation is done in a query optimizer.

2.2.1 Statistical Summaries of Data

For every table, statistical information usually includes the number of tuples,

the average size of each tuple, and other values, such as the number of physical

pages used by the table or the number of distinct tuples in the table. Statis-

tical information on table columns, if available, is helpful for estimating the

cardinality of range or join predicates. A large body of work in the literature

studies the representation of statistics on a given column or combination of

columns. In most systems, information on the data distribution on a column

is provided by histograms, since they can be built and maintained eciently

by using sampling techniques.

Histograms divide the values of a column (or set of columns) into a number

of buckets and associate with each bucket some aggregated information. The

number of buckets influences the accuracy of the histogram but also affects

memory usage, since relevant histograms are loaded into memory during opti-

mization. Both the particular procedure used to select bucket boundaries and

the aggregated information associated with each bucket lead to different fam-

ilies of histograms. An example is the family of equi-depth histograms, which

divide the domain of a column in a table with n tuples among k buckets so that

each range has approximately the same number of values n

k . Other histogram

variants are equi-width, max-diff, and end-biased histograms, which are illus-

trated in Figure 2.5. Although single-dimensional histograms are widely used,

multidimensional histograms are relatively rare.

In addition to histograms, database systems maintain other statistics,

such as the number of distinct values in a column (and sometimes certain

/

Automated Physical Database Design and Tuning

Search WWH ::

Custom Search

Home