Databases Reference
In-Depth Information
2.3 Partitioning the Training Data Set
As we discussed in Sect. 1, special care has to be taken to account for the
skewness of the database. To this end, we propose to apply the ARM algorithm
to certain partitions of D TR . The partitions are constructed based on the class
labels of the training data instances that have been pre-classified. A separate
partition is created for each class label, irrespective of whether the class label
is a singleton or a composite propositio n from Θ C . Thus, we enumerate the
'newly created' class labels as C ( k ) ,k = 1 ,N TC ,where
2 C | .
Note that N TC attains its upper bound when the class labels of the training
data set span all possible subsets from Θ C .
Denoting each partition by P ( k ) ,k = 1 ,N TC , the training data set can be
represented as the union of the partitions, viz.,
|
Θ C |≤
N TC
N TC
P ( k ) .
D TR =
(11)
k =1
It is clear that the partitions are mutually exclusive, i.e., P ( k 1 )
∩ P ( k 2 )
= ,
whenever k 1
= k 2 .
Recall the example in Sect. 1. Suppose certain training data instances have
been classified as ( OfConcern , Dangerous ) due to the conflicting options of the
experts. Thus, the training data set could be subdivided into five partitions,
{
P (1) ,P (2) ,P (3) ,P (4) ,P (5)
, where the first four partitions would contain the
training data instances with labels NotDangerous , OfConcern , Dangerous and
ExtremelyDangerous , respectively, and P (5) would correspond to the ambigu-
ous class label ( OfConcern , Dangerous ).
}
2.4 Partitioned-ARM
The ARM algorithm generates rules r i of the form X
Y ,wherethe an-
tecedent is X
Θ C . The 'quality' of a rule is
characterized by the support and confidence measures. To achieve an approx-
imately equal number of rules inside each partition, we modify the support
measure as
Θ F and consequence is Y
Y ) ,P ( k ) )
support = Count (( X
,
(12)
|
P ( k )
|
i.e., we calculate the support for a rule based on the partition. This is in
contrast to the usual practice of calculating it based on the whole database.
Here, Count (( X
Y ) ,P ( k ) ) is the number of data instances <X,Y> inside
the partition P ( k ) . We define the confidence of the rule using
Y ) ,P ( k ) )
Z⊆Θ C Count (( X
Count (( X
confidence =
(13)
Z ) ,D TR )
 
Search WWH ::




Custom Search