THE DATA WAREHOUSE - Fundamentals of Database Management Systems

Databases Reference

In-Depth Information

account, the bank doesn't care what the balance was ten days ago or ten hours ago.

The bank wants to know what the current balance is. There is no need to associate a

date or time with the bank balance; in effect, the data's date and time is always now .

(To be sure, some transactional data must include timestamps. A health insurance

company may keep six months of claim data online and such data clearly requires

timestamps.) On the other hand, data warehouse data, with its historical nature,

always includes some kind of a timestamp. If we are storing sales data on a weekly

or monthly basis and we have accumulated ten years of such historic data, each

weekly or monthly sales figure obviously must be accompanied by a timestamp

indicating the week or month (and year!) that it represents.

The Data Must Be High Quality

Transactional data can actually be somewhat forgiving of at least certain kinds of

errors. In the bank record example, the account balance must be accurate but if

there is, say, a one-letter misspelling of the street name in the account holder's

street address, that probably will not make a difference. It will not affect the account

balance and the post office will probably still deliver the account statements to the

right house. But what if the customer's street address is actually spelled correctly

in other transactional files? Consider a section of a data warehouse in which the

subject is 'customer.' It is crucial to establish an accurate set of customers for the

data warehouse data to be of any use. But with the address misspelling in one

transactional file, when the data from that file is integrated with the data from the

other transactional files, there will be some difficulty in reconciling whether the two

different addresses are the same and both represent one customer, or whether they

actually represent two different customers. This must be investigated and a decision

made on whether the records in the different files represent one customer or two

different customers. It is in this sense that the data in the data warehouse must be of

higher quality than the data in the transactional files.

The Data May Be Aggregated

When the data is copied and integrated from the transactional files into the data

warehouse, it is often aggregated or summarized, for at least three reasons. One

is that the type of data that management requires for decision making is generally

summarized data. When trying to decide how much stock to order for a store for

next December based on the sales data from the last five Decembers, the monthly

sales figures are obviously useful but the individual daily sales figures during those

last five Decembers probably don't matter much. The second reason for having

aggregated data in the data warehouse is that the sheer volume of all of the historical

detail data would often make the data warehouse unacceptably huge (they tend to

be large as it is!). And the third reason is that if the detail data were stored in the

data warehouse, the amount of time needed to summarize the data for management

every time a query was posed would often be unacceptable. Having said all that, the

decision support environment is so broad that some situations within it do call for

detail data and, indeed, some data warehouses do contain at least some detail data.

The Data is Often Denormalized

One of the fundamental truths about database we have already encountered is that

data redundancy improves the performance of read-only queries but takes up more

disk space, requires more time to update, and introduces possible data integrity

problems when the data has to be updated. But in the case of the data warehouse,

Fundamentals of Database Management Systems

Search WWH ::

Custom Search

Home