Databases Reference
In-Depth Information
events can be more interesting than the more regularly occurring ones. The analysis of
outlier data is referred to as outlier analysis or anomaly mining .
Outliers may be detected using statistical tests that assume a distribution or proba-
bility model for the data, or using distance measures where objects that are remote from
any other cluster are considered outliers. Rather than using statistical or distance mea-
sures, density-based methods may identify outliers in a local region, although they look
normal from a global statistical distribution view.
Example 1.10 Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of unusually large amounts for a given account number in compari-
son to regular charges incurred by the same account. Outlier values may also be detected
with respect to the locations and types of purchase, or the purchase frequency.
Outlier analysis is discussed in Chapter 12.
1.4.6 Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
You may ask, “Are all of the patterns interesting?” Typically, the answer is no—only
a small fraction of the patterns potentially generated would actually be of interest to a
given user.
This raises some serious questions for data mining. You may wonder, “What makes a
pattern interesting? Can a data mining system generate all of the interesting patterns? Or,
Can the system generate only the interesting ones?”
To answer the first question, a pattern is interesting if it is (1) easily understood by
humans, (2) valid on new or test data with some degree of certainty , (3) potentially
useful , and (4) novel . A pattern is also interesting if it validates a hypothesis that the user
sought to confirm . An interesting pattern represents knowledge .
Several objective measures of pattern interestingness exist. These are based on
the structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form X ) Y is rule support , representing the per-
centage of transactions from a transaction database that the given rule satisfies. This is
taken to be the probability P
, where X [ Y indicates that a transaction contains
both X and Y , that is, the union of itemsets X and Y . Another objective measure for
association rules is confidence , which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P
.
X [ Y
/
Y j X ), that is, the prob-
ability that a transaction containing X also contains Y . More formally, support and
confidence are defined as
.
support
.
X ) Y
/D P
.
X [ Y
/
,
confidence
.
X ) Y
/D P
.
Y j X
/
.
In general, each interestingness measure is associated with a threshold, which may be
controlled by the user. For example, rules that do not satisfy a confidence threshold of,
 
Search WWH ::




Custom Search