Databases Reference
In-Depth Information
account, the bank doesn't care what the balance was ten days ago or ten hours ago.
The bank wants to know what the current balance is. There is no need to associate a
date or time with the bank balance; in effect, the data's date and time is always now .
(To be sure, some transactional data must include timestamps. A health insurance
company may keep six months of claim data online and such data clearly requires
timestamps.) On the other hand, data warehouse data, with its historical nature,
always includes some kind of a timestamp. If we are storing sales data on a weekly
or monthly basis and we have accumulated ten years of such historic data, each
weekly or monthly sales figure obviously must be accompanied by a timestamp
indicating the week or month (and year!) that it represents.
The Data Must Be High Quality
Transactional data can actually be somewhat forgiving of at least certain kinds of
errors. In the bank record example, the account balance must be accurate but if
there is, say, a one-letter misspelling of the street name in the account holder's
street address, that probably will not make a difference. It will not affect the account
balance and the post office will probably still deliver the account statements to the
right house. But what if the customer's street address is actually spelled correctly
in other transactional files? Consider a section of a data warehouse in which the
subject is 'customer.' It is crucial to establish an accurate set of customers for the
data warehouse data to be of any use. But with the address misspelling in one
transactional file, when the data from that file is integrated with the data from the
other transactional files, there will be some difficulty in reconciling whether the two
different addresses are the same and both represent one customer, or whether they
actually represent two different customers. This must be investigated and a decision
made on whether the records in the different files represent one customer or two
different customers. It is in this sense that the data in the data warehouse must be of
higher quality than the data in the transactional files.
The Data May Be Aggregated
When the data is copied and integrated from the transactional files into the data
warehouse, it is often aggregated or summarized, for at least three reasons. One
is that the type of data that management requires for decision making is generally
summarized data. When trying to decide how much stock to order for a store for
next December based on the sales data from the last five Decembers, the monthly
sales figures are obviously useful but the individual daily sales figures during those
last five Decembers probably don't matter much. The second reason for having
aggregated data in the data warehouse is that the sheer volume of all of the historical
detail data would often make the data warehouse unacceptably huge (they tend to
be large as it is!). And the third reason is that if the detail data were stored in the
data warehouse, the amount of time needed to summarize the data for management
every time a query was posed would often be unacceptable. Having said all that, the
decision support environment is so broad that some situations within it do call for
detail data and, indeed, some data warehouses do contain at least some detail data.
The Data is Often Denormalized
One of the fundamental truths about database we have already encountered is that
data redundancy improves the performance of read-only queries but takes up more
disk space, requires more time to update, and introduces possible data integrity
problems when the data has to be updated. But in the case of the data warehouse,
Search WWH ::




Custom Search