Mining Efficiently Significant Classification Association Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

contains a set of items I ⊆

each T i ∈T

I . In ARM, two threshold values are

usually used to determine the significance of an AR:

•

Support: The frequency that the items occur or co-occur in

. A support

threshold σ , defined by the user, is used to distinguish frequent items from

the infrequent ones. A set of items S is called an itemset, where S

T

⊆

I ,

and

∀

a i ∈

S co-occur in

T

. If the occurrences of some S in

T

exceeds σ ,

we say that S is a Frequent Itemset (FI).

•

Confidence: Represents how “strongly” an itemset X implies another item-

set Y ,where X,Y

. A confidence threshold α ,

supplied by the user, is used to distinguish high confidence ARs from low

confidence ARs.

⊆

I and X

∩

Y =

{}

Y is valid when the support for the co-occurrence of X and

Y exceeds σ , and the confidence of this AR exceeds α . The computation of

support is X ∪ Y

|T |

An AR X

⇒

,where

|T |

is the size function of the set

T

. The computation

of confidence is Support ( X ∪ Y )

Support ( X ) . Informally, X ⇒ Y can be interpreted as “if

X exists, it is likely that Y also exists”. With regards to the history of ARM

investigation, three major categories of serial (non-parallel) ARM algorithms

can be identified: (1) mining ARs from all possible FIs, (2) mining ARs from

Maximal Frequent Itemsets (MFIs), and (3) mining ARs from Frequent Closed

Itemsets (FCIs).

Mining ARs from FIs

In the past decade, many algorithms have been introduced that mine ARs

from identified FIs. These algorithms can be further grouped into different

“families”, such as Pure-apriori like, Semi-apriori like, Set Enumeration Tree

like, etc.

•

Pure-apriori like where FIs are generated based on the generate-prune

level by level iteration that was first promulgated in the Apriori algo-

rithm [2]. In this “family” archetypal algorithms include: Apriori, Apriori-

Tid and AprioriHybrid [2], Partition [49], DHP [43], Sampling [50], DIC [7],

CARMA [31], etc.

•

Semi-apriori like where FIs are generated by enumerating candidate item-

sets but do not apply the Apriori generate-prune iterative approach

founded on (1) the join procedure, and (2) the prune procedure that em-

ploys the closure property of itemsets - if an itemset is frequent then all

its subsets will also be frequent; if an itemset is infrequent then all its su-

persets will also be infrequent. In this “family” typical algorithms include:

AIS [1], SETM [33], OCD [40], etc.

•

Set Enumeration Tree like where FIs are generated through constructing

a set enumeration tree structure [48] from D T , which avoids the need

to enumerate a large number of candidate itemsets. In this “family” a

number of approaches can be further divided into two main streams: (1)

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home