Interactive Comprehensible Data Mining - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

comprehensible results. Rules are inherently much more understandable than

decision trees or probabilistic or statistical descriptions.

Thirdly, the genetic algorithm aims to discover rules and rule sets which

optimize an objective function (“fitness”), and manipulation of this allows us

to explore different areas of the search space. For example, we can strongly

penalize rules that give false positive in order to obtain rules that can be used to

determine the class of new data examples. Alternatively, we can bias the system

towards rules that indicate the typical characteristics of items in a group, whether

these characteristics are shared with another group or not. Short rules with few

terms are going to be easier to comprehend than longer ones, but longer rules

reveal more information. We can allow the user to choose which they would

prefer by controlling the fitness function. Initially we might prefer short rules,

in order to get an overview. Note that since the Haiku system is interactive

and iterative, when we have a higher level of comprehension, we can repeat the

process but allow the rules to become longer and hence more detailed, moving

our understanding from a broad overview to comprehending the details.

We use a special type of GA that evolves rules; these produce terms to

describe the underlying data of the form:

IF

term

OP value

|

range (& ...) THEN

term

OP value

|

range (& ...)

where

term

is a class from the data set, OP is one of the standard comparison

, £ , ∧ ), value is a numeric or symbolic value, and range is a

numeric range. A typical rule would therefore be:

operators (

<, >,

=

colour

texture

size <

fruit

IF

=strawberry

There are three situations that are of particular interest to us:

=red &

=soft &

3.2 THEN

-

Classification:

when the left hand side of the equation tries to predict a single

class (usually known) on the right hand side

-

Characterization:

when the system tries to find rules that describe portions

of the data set

-

Association:

which detects correlations in attribute values within a portion

of the data set

The algorithm follows fairly typical genetic algorithmic approaches in its

implementation, but with specialized mutation and crossover operators, in order

to explore the space effectively. We start with a number of random rules created

using values from the data. The rules population is then evolved based on how

well they perform. The fittest rules are taken as the basis for the next population,

with crossover creating new rules from clauses of previously successful rules.

Mutation is specialized: for ranges of values it can expand or contract that range;

for numbers it can increase or decrease them; for operators it can substitute them

with others.

Statistically principled comparisons showed that this technique is at least as

good as conventional machine learning at classification [1], but has advantages

over the more conventional approaches in that it can discover characteristics and

associations too.

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home