Databases Reference
In-Depth Information
acquiring knowledge, e.g., by rule induction or tree generation from complete
data sets. In this strategy conversion of incomplete data sets to complete data
sets is a preprocessing to the main process of data mining. In the later strategy,
knowledge is acquired from incomplete data sets taking into account that
some attribute values are missing. The original data sets are not converted
into complete data sets.
Typical examples of the former strategy include [4, 11]:
Replacing missing attribute values by the most common (most frequent)
value of the attribute.
Replacing missing attribute values restricted to the concept. For each con-
cept missing attribute values are replaced by the most common attribute
value restricted to that concept.
For numerical attributes, missing attribute value may be replaced by the
attribute average value.
For numerical attributes, missing attribute value may be replaced by the
attribute average value restricted to the concept.
Assigning all possible values of the attribute. A case with a missing
attribute value is replaced by a set of new cases, in which the missing
attribute value is replaced by all possible values of the attribute.
Assigning all possible values of the attribute restricted to the concept.
Ignoring cases with missing attribute values. An original data set, with
missing attribute values, is replaced by a new data set with deleted cases
containing missing attribute values.
Considering missing attribute values as special values.
The later strategy is exemplified by the C4.5 approach to missing attribute
values [21] or by a modified LEM2 algorithm [9,13]. In both algorithms original
data sets with missing attribute values are not preprocessed, i.e., data sets are
not preliminarily converted into complete data sets.
Note that from the view point of rough set theory, in the former strat-
egy the conventional indiscernibility relation may be applied to describe the
process of data mining since, after preprocessing, the data set is complete (has
no missing attribute values). Furthermore, lower and upper approximations,
other basic ideas of rough set theory, are also conventional.
In this chapter we will concentrate on the later strategy used for rule
induction, i.e., we will assume that the rule sets are induced from the original
data sets, with missing attribute values, not preprocessed as in the former
strategy.
We will assume that there are three reasons for decision tables to be in-
complete. The first reason is that an attribute value, for a specific case, is
lost. For example, originally the attribute value was known, however, due to a
variety of reasons, currently the value is not available. Maybe it was recorded
but later it was erased. The second possibility is that an attribute value was
not relevant - the case was decided to be a member of some concept, i.e., was
classified, or diagnosed, in spite of the fact that some attribute values were not
Search WWH ::




Custom Search