Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

enables the discovery but instead a set of primitive search strategies which do not

necessarily explore the whole search space due to their dependence on the kind of

semantic information previously extracted.

Although DM tasks have been commonly tackled as learning problems, the na-

ture of DM suggests that the problem of DM (i.e., finding unseen, novel and in-

teresting patterns) should be seen as involving search (i.e., different hypotheses are

explored) and optimization (i.e., hypotheses which maximize quality criteria should

be preferred) instead.

Despite there being a significant and successful number of practical search and

optimization techniques [24, 5], there are some features that make some techniques

more appealing to perform this kind of task than others, in terms of representation

required, training sets required, supervision, hypothesis assessment, robustness in

the search, etc.

In particular, the kind of evolutionary computation technique known as Genetic

Algorithms (GA) has proved to be promising for search and optimization purposes.

Compared with classical search and optimization algorithms, GAs are much less

susceptible to getting stuck in local suboptimal regions of the search space as they

perform global search by exploring solutions in parallel. GAs are robust and able to

cope with noisy and missing data, they can search spaces of hypotheses containing

complex interacting parts, where the impact of each part on overall hypothesis fitness

may be di cult to model [13].

In order to use GAs to find optimal values of decision variables, we first need to

represent the hypotheses in binary strings (the typical pseudo-chromosomal repre-

sentation of a hypothesis in traditional GAs). After creating an initial population of

strings at random, genetic operations are applied with some probability in order to

improve the population. Once a new string is created by the operators, the solution

is evaluated in terms of its measure of individual goodness referred to as fitness .

Individuals for the next generation are selected according to their fitness values,

which will determine those to be chosen for reproduction. If a termination condition

is not satisfied, the population is modified by the operators and a new (and hopefully

better) population is created. Each interaction in this process is called a generation

and the entire set of generations is called a run . At the end of a run there is often

one or more highly fit chromosomes in the population.

One of the major contributions of evolutionary algorithms (e.g., GAs) for an im-

portant number of DM tasks (e.g., rule discovery, etc.) is that they tend to cope well

with attribute interactions. This is in contrast to the local, greedy search performed

by often-used rule induction and decision-tree algorithms [3, 14]. Most rule induc-

tion algorithms generate (prune) a rule by selecting (removing) one rule condition

at a time, whereas evolutionary algorithms usually evaluate a rule as a whole via the

fitness function rather than evaluating the impact of adding/removing one condition

to/from a rule. In addition, operations such as crossover usually swap several rule

conditions at a time between two individuals.

Typical tasks for GAs in DM have included [12, 34]: Classification ; in which the

goal is to predict the value (the class) of a user-defined goal attribute based on the

values of other attributes; Discovery of Association rules ; where binary attributes

(items) contained in data instances (i.e., records) are used to discover associations

of the form IF-THEN, Rule discovery/prediction ; in which the system can produce

many different combinations of attributes. (even if the original attributes do not

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home