Information Technology Reference
In-Depth Information
enables the discovery but instead a set of primitive search strategies which do not
necessarily explore the whole search space due to their dependence on the kind of
semantic information previously extracted.
Although DM tasks have been commonly tackled as learning problems, the na-
ture of DM suggests that the problem of DM (i.e., finding unseen, novel and in-
teresting patterns) should be seen as involving search (i.e., different hypotheses are
explored) and optimization (i.e., hypotheses which maximize quality criteria should
be preferred) instead.
Despite there being a significant and successful number of practical search and
optimization techniques [24, 5], there are some features that make some techniques
more appealing to perform this kind of task than others, in terms of representation
required, training sets required, supervision, hypothesis assessment, robustness in
the search, etc.
In particular, the kind of evolutionary computation technique known as Genetic
Algorithms (GA) has proved to be promising for search and optimization purposes.
Compared with classical search and optimization algorithms, GAs are much less
susceptible to getting stuck in local suboptimal regions of the search space as they
perform global search by exploring solutions in parallel. GAs are robust and able to
cope with noisy and missing data, they can search spaces of hypotheses containing
complex interacting parts, where the impact of each part on overall hypothesis fitness
may be di cult to model [13].
In order to use GAs to find optimal values of decision variables, we first need to
represent the hypotheses in binary strings (the typical pseudo-chromosomal repre-
sentation of a hypothesis in traditional GAs). After creating an initial population of
strings at random, genetic operations are applied with some probability in order to
improve the population. Once a new string is created by the operators, the solution
is evaluated in terms of its measure of individual goodness referred to as fitness .
Individuals for the next generation are selected according to their fitness values,
which will determine those to be chosen for reproduction. If a termination condition
is not satisfied, the population is modified by the operators and a new (and hopefully
better) population is created. Each interaction in this process is called a generation
and the entire set of generations is called a run . At the end of a run there is often
one or more highly fit chromosomes in the population.
One of the major contributions of evolutionary algorithms (e.g., GAs) for an im-
portant number of DM tasks (e.g., rule discovery, etc.) is that they tend to cope well
with attribute interactions. This is in contrast to the local, greedy search performed
by often-used rule induction and decision-tree algorithms [3, 14]. Most rule induc-
tion algorithms generate (prune) a rule by selecting (removing) one rule condition
at a time, whereas evolutionary algorithms usually evaluate a rule as a whole via the
fitness function rather than evaluating the impact of adding/removing one condition
to/from a rule. In addition, operations such as crossover usually swap several rule
conditions at a time between two individuals.
Typical tasks for GAs in DM have included [12, 34]: Classification ; in which the
goal is to predict the value (the class) of a user-defined goal attribute based on the
values of other attributes; Discovery of Association rules ; where binary attributes
(items) contained in data instances (i.e., records) are used to discover associations
of the form IF-THEN, Rule discovery/prediction ; in which the system can produce
many different combinations of attributes. (even if the original attributes do not
Search WWH ::




Custom Search