Query Optimization and Plan Selection - Physical Database Design

Databases Reference

In-Depth Information

The selectivity of the union of two selection operations (predicates) on the same

table can be estimated using the well-known formula for randomly selected variables:

×

S(P or Q) = S(P) + S(Q) - S(P)

S(Q)

3.5

where P and Q are predicates.

So if we take the same query above and replace the intersection of predicates with a

union of predicates, we have:

SELECT city, qty

FROM shipment

WHERE city = 'London'

OR qty = 1000;

S(city = 'London') = .3

S(qty = 1000) = .6

S(city = 'London' or qty = 1000) = .3 + .6 - .3

×

.6

= .72.

3.5.2 Histograms

The use of average values to compute selectivities can be reasonably accurate for some

data, but for other data it may be off by significantly large amounts. If all databases only

used this approximation, estimates of query time could be seriously misleading. Fortu-

nately, many database management systems now store the actual distribution of

attribute values as a histogram . In a histogram, the values of an attribute are divided into

ranges, and within each range, a count of the number of rows whose attribute falls

within that range is made.

In the example above we were given the selectivity of qty = 1000 to be .6. If we

know that there are 2,000 different quantities in the shipment table out of 100,000

rows, then the average number of rows for a given quantity would be 100,000/2,000 =

50. Therefore, the selectivity of qty = 1000 would be 50/100,000 = .0005. If we have

stored a histogram of quantities in ranges consisting of integer values: 1, 2, 3, 4, ……,

1,000, 1,001,……2,000, and found that we had 60,000 rows containing quantity val-

ues equal to1,000, we would estimate the selectivity of qty = 1000 to be .6. This is a

huge difference in accuracy that would have dramatic effects on query execution plan

cost estimation and optimal plan selection.

Search WWH ::

Custom Search

Home