Introduction - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

events can be more interesting than the more regularly occurring ones. The analysis of

outlier data is referred to as outlier analysis or anomaly mining .

Outliers may be detected using statistical tests that assume a distribution or proba-

bility model for the data, or using distance measures where objects that are remote from

any other cluster are considered outliers. Rather than using statistical or distance mea-

sures, density-based methods may identify outliers in a local region, although they look

normal from a global statistical distribution view.

Example 1.10 Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of unusually large amounts for a given account number in compari-

son to regular charges incurred by the same account. Outlier values may also be detected

with respect to the locations and types of purchase, or the purchase frequency.

Outlier analysis is discussed in Chapter 12.

1.4.6 Are All Patterns Interesting?

A data mining system has the potential to generate thousands or even millions of

patterns, or rules.

You may ask, “Are all of the patterns interesting?” Typically, the answer is no—only

a small fraction of the patterns potentially generated would actually be of interest to a

given user.

This raises some serious questions for data mining. You may wonder, “What makes a

pattern interesting? Can a data mining system generate all of the interesting patterns? Or,

Can the system generate only the interesting ones?”

To answer the first question, a pattern is interesting if it is (1) easily understood by

humans, (2) valid on new or test data with some degree of certainty , (3) potentially

useful , and (4) novel . A pattern is also interesting if it validates a hypothesis that the user

sought to confirm . An interesting pattern represents knowledge .

Several objective measures of pattern interestingness exist. These are based on

the structure of discovered patterns and the statistics underlying them. An objective

measure for association rules of the form X ) Y is rule support , representing the per-

centage of transactions from a transaction database that the given rule satisfies. This is

taken to be the probability P

, where X [ Y indicates that a transaction contains

both X and Y , that is, the union of itemsets X and Y . Another objective measure for

association rules is confidence , which assesses the degree of certainty of the detected

association. This is taken to be the conditional probability P

.

X [ Y

/

Y j X ), that is, the prob-

ability that a transaction containing X also contains Y . More formally, support and

confidence are defined as

.

support

.

X ) Y

/D P

.

X [ Y

/

,

confidence

.

X ) Y

/D P

.

Y j X

/

.

In general, each interestingness measure is associated with a threshold, which may be

controlled by the user. For example, rules that do not satisfy a confidence threshold of,

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home