Classification: Advanced Methods - Data Mining: Concepts and Techniques - page 427

Databases Reference

In-Depth Information

Genetic algorithms are easily parallelizable and have been used for classification as

well as other optimization problems. In data mining, they may be used to evaluate the

fitness of other algorithms.

9.6.2 Rough Set Approach

Rough set theory can be used for classification to discover structural relationships within

imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued

attributes must therefore be discretized before its use.

Rough set theory is based on the establishment of equivalence classes within the

given training data. All the data tuples forming an equivalence class are indiscernible,

that is, the samples are identical with respect to the attributes describing the data. Given

real-world data, it is common that some classes cannot be distinguished in terms of the

available attributes. Rough sets can be used to approximately or “roughly” define such

classes. A rough set definition for a given class, C , is approximated by two sets—a lower

approximation of C and an upper approximation of C . The lower approximation of C

consists of all the data tuples that, based on the knowledge of the attributes, are certain to

belong to C without ambiguity. The upper approximation of C consists of all the tuples

that, based on the knowledge of the attributes, cannot be described as not belonging to

C . The lower and upper approximations for a class C are shown in Figure 9.14, where

each rectangular region represents an equivalence class. Decision rules can be generated

for each class. Typically, a decision table is used to represent the rules.

Rough sets can also be used for attribute subset selection (or feature reduction, where

attributes that do not contribute to the classification of the given training data can be

identified and removed) and relevance analysis (where the contribution or significance

of each attribute is assessed with respect to the classification task). The problem of find-

ing the minimal subsets ( reducts ) of attributes that can describe all the concepts in

the given data set is NP-hard. However, algorithms to reduce the computation intensity

have been proposed. In one method, for example, a discernibility matrix is used that

stores the differences between attribute values for each pair of data tuples. Rather than

C

Upper approximation of C

Lower approximation of C

Figure 9.14 A rough set approximation of class C 's set of tuples using lower and upper approximation

sets of C . The rectangular regions represent equivalence classes.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home