Database Reference
In-Depth Information
A typical query in Data Warehouse systems selects data from a facts table, joining it with one or more dimensions
tables. SQL Server detects star and snowflake database schemas, and it uses a few optimization techniques to try
and reduce the number of rows to scan and the amount of I/O required for a query. It pushes predicates towards the
lowest operators in the execution plan tree, trying to evaluate them as early as possible and reducing the number of
rows that need to be selected. Other optimizations include a cross join of dimensions tables and hash joins pre-filtering
with Bitmap filters.
Defining foreign key constraints between facts and dimensions tables helps sQL server detect star and
snowflake schemas more reliably. You may consider creating foreign key constraints using the WITH NOCHECK option
if the overhead of constraint validation at the creation stage is unacceptable.
Note
Even with all optimizations, however, query performance in large Data Warehouses is not always sufficient.
Scanning gigabytes or terabytes of data is time consuming even on today's hardware. Part of the problem is the nature
of query processing in SQL Server; that is, operators request and process rows one by one, which is not always efficient
in the case of a large number of rows.
Some of these problems can be addressed with columnstore indexes and batch-mode data processing, which I
will cover next.
Columnstore Indexes and Batch-Mode Processing Overview
As already mentioned, the typical Data Warehouse query joins facts and dimensions tables and performs some
calculations and aggregations accessing just a subset of facts table's columns. Listing 34-1 shows an example of such a
query in the database that follows the star schema design pattern, as was shown in Figure 34-1 .
Listing 34-1. Typical query in Data Warehouse environment
select a.ArticleCode, sum(s.Quantity) as [Units Sold]
from
dbo.FactSales s join dbo.DimArticles a on
s.ArticleId = a.ArticleId
join dbo.DimDates d on
s.DateId = d.DateId
where
d.AnYear = 2014
group by
a.ArticleCode
As you can see, this query needs to perform a scan of a large amount of data from the facts table; however, it uses
just three table columns. With regular row-based processing, SQL Server accesses rows one by one, loading the entire
row into memory, regardless of how many columns from the row are required.
You can reduce the storage size of the table and, therefore, the number of I/O operations by implementing page
compression. However, page compression works in the scope of a single page. All pages will maintain a separate copy
of the compression dictionary rather than use a single copy of dictionary for each table.
Finally, there is another, less obvious problem. Even though access to in-memory data is orders of magnitude
faster than access to the data on disk, it is still slow as compared to CPU Cache access time. With row-based
processing, SQL Server constantly reloads CPU Cache data with new rows copied from main memory. This overhead
is usually not a problem with OLTP workload and simple queries; however, it becomes very noticeable with Data
Warehouse queries that process millions or even billions of rows.
 
 
Search WWH ::




Custom Search