Using Association Rules for Classification from Databases Having Class Label Ambiguities: A Belief Theoretic Method - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

2.3 Partitioning the Training Data Set

As we discussed in Sect. 1, special care has to be taken to account for the

skewness of the database. To this end, we propose to apply the ARM algorithm

to certain partitions of D TR . The partitions are constructed based on the class

labels of the training data instances that have been pre-classified. A separate

partition is created for each class label, irrespective of whether the class label

is a singleton or a composite propositio n from Θ C . Thus, we enumerate the

'newly created' class labels as C ( k ) ,k = 1 ,N TC ,where

2 |Θ C | .

Note that N TC attains its upper bound when the class labels of the training

data set span all possible subsets from Θ C .

Denoting each partition by P ( k ) ,k = 1 ,N TC , the training data set can be

represented as the union of the partitions, viz.,

Θ C |≤

N TC ≤

N TC

P ( k ) .

D TR =

(11)

k =1

It is clear that the partitions are mutually exclusive, i.e., P ( k 1 )

∩ P ( k 2 )

= ∅ ,

whenever k 1

= k 2 .

Recall the example in Sect. 1. Suppose certain training data instances have

been classified as ( OfConcern , Dangerous ) due to the conflicting options of the

experts. Thus, the training data set could be subdivided into five partitions,

{

P (1) ,P (2) ,P (3) ,P (4) ,P (5)

, where the first four partitions would contain the

training data instances with labels NotDangerous , OfConcern , Dangerous and

ExtremelyDangerous , respectively, and P (5) would correspond to the ambigu-

ous class label ( OfConcern , Dangerous ).

}

2.4 Partitioned-ARM

The ARM algorithm generates rules r i of the form X

→

Y ,wherethe an-

tecedent is X

Θ C . The 'quality' of a rule is

characterized by the support and confidence measures. To achieve an approx-

imately equal number of rules inside each partition, we modify the support

measure as

⊆

Θ F and consequence is Y

⊆

Y ) ,P ( k ) )

support = Count (( X

→

(12)

P ( k )

i.e., we calculate the support for a rule based on the partition. This is in

contrast to the usual practice of calculating it based on the whole database.

Here, Count (( X

Y ) ,P ( k ) ) is the number of data instances <X,Y> inside

the partition P ( k ) . We define the confidence of the rule using

→

Y ) ,P ( k ) )

Z⊆Θ C Count (( X

Count (( X

→

confidence =

(13)

→

Z ) ,D TR )

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home