Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

a) Creating the initial population of hypotheses:

once the initial rules have been produced, their components (rhetorical roles,

predicate relations, etc.) are isolated and become a separate “database.”

This information is used both to build the initial hypotheses and to feed the

further genetic operations (i.e., mutation of roles will need to randomly pick

a role from this database).

b) Computing training information (in which two kinds of training data are

obtained) :

a) Computing correlations between rhetorical roles and predicate relations:

the connection between rhetorical information and the predicate action

constitutes key information for producing coherent hypotheses. For ex-

ample, is, in some domain, the goal of some hypothesis likely to be as-

sociated with the construction of some component? In a health context,

this connection would be less likely than having “ finding a new medicine

for ..” as a goal .

In order to address this issue, we adopted a Bayesian approach where

we obtain the conditional probability of some predicate p given some

attached rhetorical role r , namely Prob ( p | r ). This probability values

are later used to automatically evaluate some of the hypotheses' criteria.

b) Computing co-occurrences of rhetorical information:

One could think of a hypothesis as an abstract having text paragraphs

which are semantically related to each other. Consequently, the meaning

of the scientific evidence stated in the abstract may subtly change if the

order of the facts is altered.

This suggests that in generating valid hypotheses there will be rule struc-

tures which are more or less desirable than others. For instance, if every

rule contains a “goal” as the first rhetorical role, and the GA has gener-

ated a hypothesis starting with some “conclusion” or “method,” it will

be penalised and therefore, it is very unlikely for that to survive in the

next generation. Since the order matters in terms of affecting the rule's

meaning, we can think of the p roles of a rule, as a sequence of tags:

<r 1 ,r 2 , ..r p > such that r i precedes r i +1 , so we generate, from the rules,

the conditional probabilities Prob ( r p | r q ), for every role r p ,r q . The prob-

ability that r q precedes r p will be used in evaluating new hypotheses, in

terms that, for instance, its coherence.

9.3.2 Knowledge Discovery and Automatic Evaluation of Patterns

Our approach to KDT is strongly guided by semantic and rhetorical information,

and consequently there are some soft constraints to be met before producing the

offspring so as to keep them coherent.

The GA will start from a initial population, which in this case, is a set of semi-

random hypotheses built up from the preprocessing phase. Next, constrained GA

operations are applied and the hypotheses are evaluated. In order for every individ-

ual to have a fitness assigned, we use a evolutionary multi-objective optimisation

strategy based on the Strength Pareto Evolutionary Algorithm (SPEA) algorithm

[35]. SPEA deals with the diversity of the solutions (i.e., niche formation) and the

fitness assignment as a whole in a representation-independent way. An attractive

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home