Three Approaches to Missing Attribute Values: A Rough Set Perspective - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

acquiring knowledge, e.g., by rule induction or tree generation from complete

data sets. In this strategy conversion of incomplete data sets to complete data

sets is a preprocessing to the main process of data mining. In the later strategy,

knowledge is acquired from incomplete data sets taking into account that

some attribute values are missing. The original data sets are not converted

into complete data sets.

Typical examples of the former strategy include [4, 11]:

•

Replacing missing attribute values by the most common (most frequent)

value of the attribute.

•

Replacing missing attribute values restricted to the concept. For each con-

cept missing attribute values are replaced by the most common attribute

value restricted to that concept.

•

For numerical attributes, missing attribute value may be replaced by the

attribute average value.

•

For numerical attributes, missing attribute value may be replaced by the

attribute average value restricted to the concept.

•

Assigning all possible values of the attribute. A case with a missing

attribute value is replaced by a set of new cases, in which the missing

attribute value is replaced by all possible values of the attribute.

•

Assigning all possible values of the attribute restricted to the concept.

•

Ignoring cases with missing attribute values. An original data set, with

missing attribute values, is replaced by a new data set with deleted cases

containing missing attribute values.

•

Considering missing attribute values as special values.

The later strategy is exemplified by the C4.5 approach to missing attribute

values [21] or by a modified LEM2 algorithm [9,13]. In both algorithms original

data sets with missing attribute values are not preprocessed, i.e., data sets are

not preliminarily converted into complete data sets.

Note that from the view point of rough set theory, in the former strat-

egy the conventional indiscernibility relation may be applied to describe the

process of data mining since, after preprocessing, the data set is complete (has

no missing attribute values). Furthermore, lower and upper approximations,

other basic ideas of rough set theory, are also conventional.

In this chapter we will concentrate on the later strategy used for rule

induction, i.e., we will assume that the rule sets are induced from the original

data sets, with missing attribute values, not preprocessed as in the former

strategy.

We will assume that there are three reasons for decision tables to be in-

complete. The first reason is that an attribute value, for a specific case, is

lost. For example, originally the attribute value was known, however, due to a

variety of reasons, currently the value is not available. Maybe it was recorded

but later it was erased. The second possibility is that an attribute value was

not relevant - the case was decided to be a member of some concept, i.e., was

classified, or diagnosed, in spite of the fact that some attribute values were not

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home