Database Reference
In-Depth Information
Chapter 12
Statistics, Data Distribution,
and Cardinality
By now, you should have a good understanding of the importance of indexes. But, the index alone is not what the
optimizer uses to determine how it's going to access data. The optimizer must have information about the data that
defines an index or a column. That information is referred to as a statistic . Statistics define both the distribution of
data and the uniqueness or selectivity of the data. Statistics are maintained both on indexes and on columns within
the system. You can even define statistics manually yourself.
In this chapter, you'll learn the importance of statistics in query optimization. Specifically, I will cover the
following topics:
The role of statistics in query optimization
The importance of statistics on columns with indexes
The importance of statistics on nonindexed columns used in join and filter criteria
Analysis of single-column and multicolumn statistics, including the computation of selectivity
of a column for indexing
Statistics maintenance
Effective evaluation of statistics used in a query execution
The Role of Statistics in Query Optimization
SQL Server's query optimizer is a cost-based optimizer; it decides on the best data access mechanism and join
strategy by identifying the selectivity, how unique the data is, and which columns are used in filtering the data
(meaning via the WHERE or JOIN clause). Statistics exist with an index, but they also exist on columns without an index
that are used as part of a predicate. As you learned in Chapter 7, a nonclustered index is a great way to retrieve data
that is covered by the index, whereas with queries that need columns outside the key, a clustered index can work
better. With a large result set, going to the clustered index or table directly is usually more beneficial.
Up-to-date information on data distribution in the columns referenced as predicates helps the optimizer
determine the query strategy to use. In SQL Server, this information is maintained in the form of statistics, which are
essential for the cost-based optimizer to create an effective query execution plan. Through the statistics, the optimizer
can make reasonably accurate estimates about how long it will take to return a result set or an intermediate result set
and therefore determine the most effective operations to use to efficiently retrieve or modify the data as defined by
the T-SQL statement. As long as you ensure that the default statistical settings for the database are set, the optimizer
will be able to do its best to determine effective processing strategies dynamically. Also, as a safety measure while
troubleshooting performance, you should ensure that the automatic statistics maintenance routine is doing its job
 
Search WWH ::




Custom Search