Introduction to Columnstore Indexes - Pro SQL Server Internals

Database Reference

In-Depth Information

A typical query in Data Warehouse systems selects data from a facts table, joining it with one or more dimensions

tables. SQL Server detects star and snowflake database schemas, and it uses a few optimization techniques to try

and reduce the number of rows to scan and the amount of I/O required for a query. It pushes predicates towards the

lowest operators in the execution plan tree, trying to evaluate them as early as possible and reducing the number of

rows that need to be selected. Other optimizations include a cross join of dimensions tables and hash joins pre-filtering

with Bitmap filters.

■ Defining foreign key constraints between facts and dimensions tables helps sQL server detect star and

snowflake schemas more reliably. You may consider creating foreign key constraints using the WITH NOCHECK option

if the overhead of constraint validation at the creation stage is unacceptable.

Note

Even with all optimizations, however, query performance in large Data Warehouses is not always sufficient.

Scanning gigabytes or terabytes of data is time consuming even on today's hardware. Part of the problem is the nature

of query processing in SQL Server; that is, operators request and process rows one by one, which is not always efficient

in the case of a large number of rows.

Some of these problems can be addressed with columnstore indexes and batch-mode data processing, which I

will cover next.

Columnstore Indexes and Batch-Mode Processing Overview

As already mentioned, the typical Data Warehouse query joins facts and dimensions tables and performs some

calculations and aggregations accessing just a subset of facts table's columns. Listing 34-1 shows an example of such a

query in the database that follows the star schema design pattern, as was shown in Figure 34-1 .

Listing 34-1. Typical query in Data Warehouse environment

select a.ArticleCode, sum(s.Quantity) as [Units Sold]

from

dbo.FactSales s join dbo.DimArticles a on

s.ArticleId = a.ArticleId

join dbo.DimDates d on

s.DateId = d.DateId

where

d.AnYear = 2014

group by

a.ArticleCode

As you can see, this query needs to perform a scan of a large amount of data from the facts table; however, it uses

just three table columns. With regular row-based processing, SQL Server accesses rows one by one, loading the entire

row into memory, regardless of how many columns from the row are required.

You can reduce the storage size of the table and, therefore, the number of I/O operations by implementing page

compression. However, page compression works in the scope of a single page. All pages will maintain a separate copy

of the compression dictionary rather than use a single copy of dictionary for each table.

Finally, there is another, less obvious problem. Even though access to in-memory data is orders of magnitude

faster than access to the data on disk, it is still slow as compared to CPU Cache access time. With row-based

processing, SQL Server constantly reloads CPU Cache data with new rows copied from main memory. This overhead

is usually not a problem with OLTP workload and simple queries; however, it becomes very noticeable with Data

Warehouse queries that process millions or even billions of rows.

Search WWH ::

Custom Search

Home