Database Reference
In-Depth Information
Consequently, using a nonclustered index over R
.
a , the resulting plan first re-
trieves the RIDs of all tuples in R that satisfy R
20. Then, using lookups
against table R , it looks up the actual tuples that correspond to those RIDs.
Finally, it performs a nested-loop join between the subset of tuples of R calcu-
lated before, and table S , which is sequentially scanned. For the case @ C =2000
in Figure 2.4c, the optimizer estimates that the number of tuples of R satisfy-
ing R
.
a
<
< 2000 is rather large. The resulting plan therefore scans both tables
sequentially (discarding on the fly the tuples from R that do not satisfy the
condition R
.
a
< 2000) and then performs a hash join to obtain the result (in
this scenario, the lookups of the previous plan would have been too numerous
and therefore too expensive). Finally, Figure 2.4b shows yet another execution
plan that is chosen when the number of tuples of R satisfying the predicate is
neither too small nor too large. In this case, table S is scanned in increasing
order of S
.
a
y using a clustered index, and table R is first scanned sequentially
(discarding invalid tuples on the fly as before) and then sorted by R
.
.
x . Finally,
a merge join is performed on the two intermediate results.
The previous examples illustrate that the cardinality of intermediate results
plays a fundamental role in the choice of execution plans. We next discuss how
cardinality estimation is done in a query optimizer.
2.2.1 Statistical Summaries of Data
For every table, statistical information usually includes the number of tuples,
the average size of each tuple, and other values, such as the number of physical
pages used by the table or the number of distinct tuples in the table. Statis-
tical information on table columns, if available, is helpful for estimating the
cardinality of range or join predicates. A large body of work in the literature
studies the representation of statistics on a given column or combination of
columns. In most systems, information on the data distribution on a column
is provided by histograms, since they can be built and maintained eciently
by using sampling techniques.
Histograms divide the values of a column (or set of columns) into a number
of buckets and associate with each bucket some aggregated information. The
number of buckets influences the accuracy of the histogram but also affects
memory usage, since relevant histograms are loaded into memory during opti-
mization. Both the particular procedure used to select bucket boundaries and
the aggregated information associated with each bucket lead to different fam-
ilies of histograms. An example is the family of equi-depth histograms, which
divide the domain of a column in a table with n tuples among k buckets so that
each range has approximately the same number of values n
k . Other histogram
variants are equi-width, max-diff, and end-biased histograms, which are illus-
trated in Figure 2.5. Although single-dimensional histograms are widely used,
multidimensional histograms are relatively rare.
In addition to histograms, database systems maintain other statistics,
such as the number of distinct values in a column (and sometimes certain
/
Search WWH ::




Custom Search