Using Association Rules for Classification from Databases Having Class Label Ambiguities: A Belief Theoretic Method - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

•

Class label ambiguities in training databases;

•

Computational and storage constraints; and

•

Skewness of the databases.

Class label ambiguities naturally arise in application scenarios especially

when domain expert knowledge is sought for classifying the training data

instances. We use a belief theoretic technique for addressing this issue. It en-

ables the proposed ARM-KNN-BF classifier to conveniently model the class

label ambiguities. Each generated rule is then treated as a BoE providing

another 'piece of evidence' for purposes of classifying an incoming data in-

stance. The final classification result is based upon the fused BoE generated

by DRC. Skewness of the training data set can also create significant di -

culties in ARM because the majority classes tend to overwhelm the minority

classes in such situations. The partitioned-ARM strategy we employ creates

an approximately equal number of rules for each class label thus solving this

problem. The use of rules generated from only the nearest neighbors (instead

of using the complete rule set) enables the use of a significantly fewer number

of rules in the BoE combination stage. This makes our classifier more compu-

tationally e cient. Applications where these issues are of critical importance

include threat detection and assessment scenarios.

As opposed to the other classifiers (such as c4.5 and KNN), belief theoretic

classifiers capture a much richer information content in the decision making

stage. Furthermore, how neighbors are defined in the ARM-KNN-BF classifier

is different than the strategy employed in the KNN-BF and KNN classifiers.

Due to the fact that the rules in the ARM-KNN-BF classifier are generated via

ARM, the rules capture the associations within the training data instances.

Thus, it is able to overcome 'noise' effects that could be induced by individual

data instances. This results in better decisions. Of course. a much smaller rule

set in the classification stage significantly reduces the storage and computa-

tional requirements, a factor that plays a major role when working with huge

databases.

The work described above opens up several interesting research issues that

warrant further study. In security monitoring and threat classification, it is

essential that one errs on the side of caution. In other words, it is always better

to overestimate the threat level than under-estimate it. So, development of

strategies that overestimate threat level at the expense of under-estimating it

is warranted.

Another important research problem involves the extension of this work

to accommodate more general types of imperfections in both class labels and

features. The work described herein handles ambiguities in class labels only;

ways to handle general belief theoretic class label imperfections [28] would be

extremely useful. Development of strategies that can address general belief

theoretic imperfections in features would further enhance the applicability of

this work. Some initial work along this line appears in [11, 12].

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home