Databases Reference
In-Depth Information
have been deleted. Furthermore, the recording of the data history or modifications may
have been overlooked. Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
Recall that data quality depends on the intended use of the data. Two different users
may have very different assessments of the quality of a given database. For example, a
marketing analyst may need to access the database mentioned before for a list of cus-
tomer addresses. Some of the addresses are outdated or incorrect, yet overall, 80% of
the addresses are accurate. The marketing analyst considers this to be a large customer
database for target marketing purposes and is pleased with the database's accuracy,
although, as sales manager, you found the data inaccurate.
Timeliness also affects data quality. Suppose that you are overseeing the distribu-
tion of monthly sales bonuses to the top sales representatives at AllElectronics . Several
sales representatives, however, fail to submit their sales records on time at the end of
the month. There are also a number of corrections and adjustments that flow in after
the month's end. For a period of time following each month, the data stored in the
database are incomplete. However, once all of the data are received, it is correct. The fact
that the month-end data are not updated in a timely fashion has a negative impact on
the data quality.
Two other factors affecting data quality are believability and interpretability. Believ-
ability reflects how much the data are trusted by users, while interpretability reflects
how easy the data are understood. Suppose that a database, at one point, had several
errors, all of which have since been corrected. The past errors, however, had caused
many problems for sales department users, and so they no longer trust the data. The
data also use many accounting codes, which the sales department does not know how to
interpret. Even though the database is now accurate, complete, consistent, and timely,
sales department users may regard it as of low quality due to poor believability and
interpretability.
3.1.2 Major Tasks in Data Preprocessing
In this section, we look at the major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smooth-
ing noisy data, identifying or removing outliers, and resolving inconsistencies. If users
believe the data are dirty, they are unlikely to trust the results of any data mining that has
been applied. Furthermore, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. Although most mining routines have some procedures
for dealing with incomplete or noisy data, they are not always robust. Instead, they may
concentrate on avoiding overfitting the data to the function being modeled. Therefore,
a useful preprocessing step is to run your data through some data cleaning routines.
Section 3.2 discusses methods for data cleaning.
Getting back to your task at AllElectronics , suppose that you would like to include
data from multiple sources in your analysis. This would involve integrating multiple
databases, data cubes, or files (i.e., data integration ). Yet some attributes representing a
 
Search WWH ::




Custom Search