Statistics, Data Distribution, and Cardinality - SQL Server Query Performance Tuning

Database Reference

In-Depth Information

Density

When creating an execution plan, the query optimizer analyzes the statistics of the columns used in the filter and JOIN

clauses. A filter criterion with high selectivity limits the number of rows from a table to a small result set and helps

the optimizer keep the query cost low. A column with a unique index will have a high selectivity since it can limit the

number of matching rows to one.

On the other hand, a filter criterion with low selectivity will return a large result set from the table. A filter

criterion with low selectivity makes a nonclustered index on the column ineffective. Navigating through a

nonclustered index to the base table for a large result set is usually costlier than scanning the base table (or clustered

index) directly because of the cost overhead of lookups associated with the nonclustered index. You can observe this

behavior in the execution plan in Figure 12-18 .

Statistics track the selectivity of a column in the form of a density ratio. A column with high selectivity

(or uniqueness) will have low density. A column with low density (that is, high selectivity) is suitable for a filtering

criteria because it can help the optimizer retrieve a small number of rows very fast. This is also the principle on which

filtered indexes operate since the filter's goal is to increase the selectivity, or density, of the index.

Density can be expressed as follows:

Density = 1 / Number of distinct values for a column

Density will always come out as a number somewhere between 0 and 1. The lower the column density, the more

suitable it is for use as an index key. You can perform your own calculations to determine the density of columns

within your own indexes and statistics. For example, to calculate the density of column c1 from the test table built by

create_t3.sql , use the following (results in Figure 12-19 ):

SELECT 1.0 / COUNT(DISTINCT C1)

FROM dbo.Test1;

Figure 12-19. Results of density calculation for column C1

You can see this as actual data in the All density column in the output from DBCC SHOW_ STATISTICS . This

high-density value for the column makes it a less suitable candidate for an index, even a filtered index. However, the

statistics of the index key values maintained in the steps help the query optimizer use the index for the predicate

C1 = 1 , as shown in the previous execution plan.

Statistics on a Multicolumn Index

In the case of an index with one column, statistics consist of a histogram and a density value for that column. Statistics

for a composite index with multiple columns consist of one histogram for the first column only and multiple density

values. This is one reason why it's generally a good practice to put the more selective column, the one with the lowest

density, first when building a compound index or compound statistics. The density values include the density for the

first column and for each prefix combination of the index key columns. Multiple density values help the optimizer

find the selectivity of the composite index when multiple columns are referred to by predicates in the WHERE and JOIN

clauses. Although the first column can help determine the histogram, the final density of the column itself would be

the same regardless of column order.

Search WWH ::

Custom Search

Home