Database Reference
In-Depth Information
knowledge . A pattern that is not new may not be interesting. For instance, when a
pattern is found according to which car accidents occur only in the group of
people of over 18 years of age, this is not surprising, since the user may have
already expected this. 15 Whether a pattern is already known to other people does
not matter; what matters is that the pattern is new to the user.
A pattern is useful when it may help in achieving the user's goals . A pattern
that does not contribute to achieving those goals may not be interesting. For
instance, a pattern that indicates groups of people who buy many topics is of no
interest to a user who wants to sell CDs. Usefulness may be divided into an
efficacy component and an efficiency component. Efficacy is an indication of the
extent to which the knowledge contributes to achieving a goal or the extent to
which the goal is achieved. Efficiency is an indication of the speed or easiness
with which the goal is achieved.
Non-triviality depends on the user's means . The user's means have to be
proportional to non-triviality: a pattern that is too trivial to compute, such as an
average, may not be interesting. On the other hand, when the user's means are too
limited to interpret the discovered pattern, it may also be difficult to speak of
'knowledge'. Looking at Figure 1.1 again, where the KDD process is illustrated,
may clarify this, as a certain insight is required for Step 4, in which the results of
data mining are interpreted.
Certainty
The second criterion for knowledge, certainty, depends on many factors. The most
important among them are the integrity of the data, the size of the sample, and the
significance of the calculated results. The integrity of the data concerns corrupted and
missing data. When only corrupted data are dealt with, the terms accuracy or
correctness are used. 16 When only missing data are dealt with, the term completeness
is used. Integrity may refer to both accuracy and the completeness of data. 17
Missing data may leave blank spaces in the database, but it may also be made
up, especially in database systems that do not allow blank spaces. For instance, the
birthdays of people in databases tend to be (more often than may be expected) on
the 1 st of January, because 1-1 is easiest to type. 18 Sometimes, a more serious
effort is made to construct the values of missing data. 19
The sample size is a second important factor influencing certainty. However,
the number of samples that needs to be taken may be difficult to determine. In
general, the larger the sample size, the more certain the results. Minimum sample
sizes for acceptable reliabilities may be about 300 data items. These and larger
samples, sometimes running up to many thousands of data items, used to be
problematic for statistical research, but current databases are usually large enough
to provide for enough samples. 20
15 In Europe, driving licenses may generally be obtained from the age of 18.
16 Berti, L., and Graveleau, D. (1998).
17 Stallings, W. (1999).
18 Denning, D.E. (1983).
19 Holsheimer, M., and Siebes, A. (1991).
20 Hand, D.J. (1998).
Search WWH ::




Custom Search