Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

solution is better than another in global terms, that is, a child is better if this is a

becomes a non-dominated hypothesis.

Next, since our model is based on a multi-criteria approach, we have to face

three important issues in order to assess every hypothesis' fitness: Pareto dominance,

fitness assignment and the diversity problem [5]. Despite an important number of

state-of-the-art methods to handle these issues [5], only a small number of them

has focused on the problem in an integrated and representation-independent way. In

particular, Zitzler [35] proposes an interesting method, Strength Pareto Evolutionary

Algorithm (SPEA) which uses a mixture of established methods and new techniques

in order to find multiple Pareto-optimal solutions in parallel, and at the same time

to keep the population as diverse as possible. We have also adapted the original

SPEA algorithm to allow for the incremental updating of the Pareto-optimal set

along with our steady-state replacement method.

9.4 Analysis and Results

In order to assess the quality of the discovered knowledge (hypotheses) by the model

a Prolog-based prototype has been built. The IE task has been implemented as a set

of modules whose main outcome is the set of rules extracted from the documents.

In addition, an intermediate training module is responsible for generating informa-

tion from the LSA analysis and from the rules just produced. The initial rules are

represented by facts containing lists of relations both for antecedent and consequent.

For the purpose of the experiments, the corpus of documents has been obtained

from the AGRIS database for agricultural and food science. We selected this kind of

corpus as it has been properly cleaned-up, and builds upon a scientific area which

we do not have any knowledge about so to avoid any possible bias and to make the

results more realistic. A set of 1000 documents was extracted from which one third

were used for setting parameters and making general adjustments, and the rest were

used for the GA itself in the evaluation stage.

Next, we tried to provide answers to two basic questions concerning our original

aims:

a) How well does the GA for KDT behave?

b) How good are the hypotheses produced according to human experts in terms of

text mining's ultimate goals: interestingness, novelty and usefulness, etc.

In order to address these issues, we used a methodology consisting of two phases:

the system evaluation and the experts' assessment.

a) System Evaluation: this aims at investigating the behavior and the results pro-

duced by the GA.

We set the GA by generating an initial population of 100 semi-random hy-

potheses. In addition, we defined the main global parameters such as Mutation

Probability (0.2), Crossover Probability (0.8), Maximum Size of Pareto set (5%),

etc. We ran five versions of the GA with the same configuration of parameters

but different pairs of terms to address the quest for explanatory novel hypothe-

ses.

The different results obtained from running the GA as used for our experiment

are shown in the form of a representative behavior in figure 9.5, where the

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home