Databases Reference
In-Depth Information
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.
9.6.2 Rough Set Approach
Rough set theory can be used for classification to discover structural relationships within
imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued
attributes must therefore be discretized before its use.
Rough set theory is based on the establishment of equivalence classes within the
given training data. All the data tuples forming an equivalence class are indiscernible,
that is, the samples are identical with respect to the attributes describing the data. Given
real-world data, it is common that some classes cannot be distinguished in terms of the
available attributes. Rough sets can be used to approximately or “roughly” define such
classes. A rough set definition for a given class, C , is approximated by two sets—a lower
approximation of C and an upper approximation of C . The lower approximation of C
consists of all the data tuples that, based on the knowledge of the attributes, are certain to
belong to C without ambiguity. The upper approximation of C consists of all the tuples
that, based on the knowledge of the attributes, cannot be described as not belonging to
C . The lower and upper approximations for a class C are shown in Figure 9.14, where
each rectangular region represents an equivalence class. Decision rules can be generated
for each class. Typically, a decision table is used to represent the rules.
Rough sets can also be used for attribute subset selection (or feature reduction, where
attributes that do not contribute to the classification of the given training data can be
identified and removed) and relevance analysis (where the contribution or significance
of each attribute is assessed with respect to the classification task). The problem of find-
ing the minimal subsets ( reducts ) of attributes that can describe all the concepts in
the given data set is NP-hard. However, algorithms to reduce the computation intensity
have been proposed. In one method, for example, a discernibility matrix is used that
stores the differences between attribute values for each pair of data tuples. Rather than
C
Upper approximation of C
Lower approximation of C
Figure 9.14 A rough set approximation of class C 's set of tuples using lower and upper approximation
sets of C . The rectangular regions represent equivalence classes.
 
Search WWH ::




Custom Search