Database Reference
In-Depth Information
data management programs from statistical package vendors such as SAS (SAS Analytics) or
IBM (IBM SPSS Statistics) in the data warehouse.
Metadata concerning the data's source, format, assumptions and constraints, and other
facts are kept in a data warehouse metadata database . The data warehouse DBMS extracts
and provides data to BI tools, such as data mining programs.
Once problematic operational data have been cleaned in the ETL system,
the corrected data can also be used to update the operational system to
fix the original data problems.
By The WAy
Problems with Operational Data
Most operational databases have problems that limit their usefulness to all but the simplest BI
applications. Figure 12-5 lists the major problem categories.
First, although data that are critical for successful operations must be complete and
accurate, data that are only marginally necessary need not be. For example, some operational
systems gather customer demographic data during the ordering process. However, because
such data are not needed to fill, ship, or bill orders, the quality of the demographic data suffers.
Problematic data are termed dirty data . Examples are a value of “G” for customer sex and
a value of “213” for customer age. Other examples are a value of “999-999-9999” for a U.S. phone
number, a part color of “gren,” and an e-mail address of “WhyMe@somewhereelseintheuni-
verse.who.” All of these values pose problems for reporting and data mining purposes.
Purchased data often contain missing elements. In fact, most data vendors state the per-
centage of missing values for each attribute in the data they sell. An organization buys such
data because, for some uses, some data are better than no data at all. This is especially true for
data items whose values are difficult to obtain, such as the number of adults in a household,
household income, dwelling type, and the education of primary income earner. Some missing
data are not too much of a problem for reporting applications. For data mining applications,
however, a few missing or erroneous data points can actually be worse than no data at all
because they bias the analysis.
Inconsistent data, the third problem in Figure 12-5, is particularly common for data that
have been gathered over time. When an area code changes, for example, the phone number
for a given customer before the change will differ from the customer's phone number after
the change. Part codes can change, as can sales territories. Before such data can be used, they
must be recoded for consistency over the period of the study.
Some data inconsistencies occur because of the nature of the business activity. Consider
a Web-based order entry system used by customers around the world. When the Web server
records the time of order, which time zone does it use? The server's system clock time is
irrelevant to an analysis of customer behavior. Any standard time such as Universal Time
Figure 12-5
Problems of Using
Transaction Data for
Business Intelligence
• Dirty data
• Missing values
• Inconsistent data
• Data not integrated
• Wrong format
- Too fine
- Not fine enough
• Too much data
- Too many attributes
- Too much volume
 
Search WWH ::




Custom Search