Database Reference
In-Depth Information
As the name implies, an incremental load is one that processes a partial set of data
based only on what is new or changed since the last execution. Although many con-
sider incremental loads to be purely time based (for example, grabbing just the data
processed on the prior business day), it's not always that simple. Sometimes, changes
are made to historical data that should be reflected in downstream systems, and unfor-
tunately it's not always possible to detect when changes were made. (As an aside, those
ubiquitous “last update timestamp” fields are notorious for being wrong.)
Handling flexible data warehouse structures that allow not only inserts, but updates
as well, can be a challenging proposition. In this chapter, we'll surface a few design
patterns in SSIS that you can use when dealing with changing data in your data ware-
house.
Why Incremental Loads?
Imagine you are hired by a small shoe store to create a system through which the staff
can analyze their sales data. Let's say the store averages 50 transactions a day, amount-
ing to an average of 100 items (2 items per transaction). If you do simple math on the
row counts generated by the business, you'll end up with fewer than 200 rows of data
per day being sent to the analysis system. Even over the course of the entire year,
you're looking at less than 75,000 rows of data. With such a small volume of data, why
would you want to perform an incremental load? After all, it would be almost as effi-
cient to simply dump and reload the entire analytical database each night rather than try
to calculate which rows are new, changed, or deleted.
In a situation like the one just described, the best course of action might be to per-
form a full load each day. However, in the real world, few, if any, systems are so small
and simple. In fact, it's not at all uncommon for data warehouse developers to work in
environments that generate millions of rows of data per day. Even with the best hard-
ware money can buy, attempting to perform a daily full load on that volume of data
simply isn't practical. The incremental load seeks to solve this problem by allowing the
systematic identification of data to be moved from transactional systems to the data
warehouse. By selectively carving out only the data that requires manipula-
tion—specifically, the rows that have been inserted, updated, or deleted since the last
load—we can eliminate a lot of unnecessary and duplicate data processing.
The Slowly Changing Dimension
Search WWH ::




Custom Search