Statistics, Data Distribution, and Cardinality - SQL Server Query Performance Tuning

Database Reference

In-Depth Information

Chapter 12

Statistics, Data Distribution,

and Cardinality

By now, you should have a good understanding of the importance of indexes. But, the index alone is not what the

optimizer uses to determine how it's going to access data. The optimizer must have information about the data that

defines an index or a column. That information is referred to as a statistic . Statistics define both the distribution of

data and the uniqueness or selectivity of the data. Statistics are maintained both on indexes and on columns within

the system. You can even define statistics manually yourself.

In this chapter, you'll learn the importance of statistics in query optimization. Specifically, I will cover the

following topics:

•

The role of statistics in query optimization

•

The importance of statistics on columns with indexes

•

The importance of statistics on nonindexed columns used in join and filter criteria

•

Analysis of single-column and multicolumn statistics, including the computation of selectivity

of a column for indexing

•

Statistics maintenance

•

Effective evaluation of statistics used in a query execution

The Role of Statistics in Query Optimization

SQL Server's query optimizer is a cost-based optimizer; it decides on the best data access mechanism and join

strategy by identifying the selectivity, how unique the data is, and which columns are used in filtering the data

(meaning via the WHERE or JOIN clause). Statistics exist with an index, but they also exist on columns without an index

that are used as part of a predicate. As you learned in Chapter 7, a nonclustered index is a great way to retrieve data

that is covered by the index, whereas with queries that need columns outside the key, a clustered index can work

better. With a large result set, going to the clustered index or table directly is usually more beneficial.

Up-to-date information on data distribution in the columns referenced as predicates helps the optimizer

determine the query strategy to use. In SQL Server, this information is maintained in the form of statistics, which are

essential for the cost-based optimizer to create an effective query execution plan. Through the statistics, the optimizer

can make reasonably accurate estimates about how long it will take to return a result set or an intermediate result set

and therefore determine the most effective operations to use to efficiently retrieve or modify the data as defined by

the T-SQL statement. As long as you ensure that the default statistical settings for the database are set, the optimizer

will be able to do its best to determine effective processing strategies dynamically. Also, as a safety measure while

troubleshooting performance, you should ensure that the automatic statistics maintenance routine is doing its job

Search WWH ::

Custom Search

Home