Databases Reference
In-Depth Information
The selectivity of the union of two selection operations (predicates) on the same
table can be estimated using the well-known formula for randomly selected variables:
×
S(P or Q) = S(P) + S(Q) - S(P)
S(Q)
3.5
where P and Q are predicates.
So if we take the same query above and replace the intersection of predicates with a
union of predicates, we have:
SELECT city, qty
FROM shipment
WHERE city = 'London'
OR qty = 1000;
S(city = 'London') = .3
S(qty = 1000) = .6
S(city = 'London' or qty = 1000) = .3 + .6 - .3
×
.6
= .72.
3.5.2 Histograms
The use of average values to compute selectivities can be reasonably accurate for some
data, but for other data it may be off by significantly large amounts. If all databases only
used this approximation, estimates of query time could be seriously misleading. Fortu-
nately, many database management systems now store the actual distribution of
attribute values as a histogram . In a histogram, the values of an attribute are divided into
ranges, and within each range, a count of the number of rows whose attribute falls
within that range is made.
In the example above we were given the selectivity of qty = 1000 to be .6. If we
know that there are 2,000 different quantities in the shipment table out of 100,000
rows, then the average number of rows for a given quantity would be 100,000/2,000 =
50. Therefore, the selectivity of qty = 1000 would be 50/100,000 = .0005. If we have
stored a histogram of quantities in ranges consisting of integer values: 1, 2, 3, 4, ……,
1,000, 1,001,……2,000, and found that we had 60,000 rows containing quantity val-
ues equal to1,000, we would estimate the selectivity of qty = 1000 to be .6. This is a
huge difference in accuracy that would have dramatic effects on query execution plan
cost estimation and optimal plan selection.
Search WWH ::




Custom Search