Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

tion and guided operations to ensure that the produced offspring are semantically

coherent.

In order to deal with issues regarding representation and new genetic operations

so to produce an effective KDT process, our working model has been divided into

two phases. The first phase is the preprocessing step aimed to produce both training

information for further evaluation and the initial population of the GA. The second

phase constitutes the knowledge discovery itself, in particular this aims at producing

and evaluating explanatory unseen hypotheses.

The whole processing starts by performing the IE task (Figure 9.1) which applies

extraction patterns and then generates a rule-like representation for each document

of the specific domain corpus. After processing a set of n documents, the extraction

stage will produce n rules, each one representing the document's content in terms of

its conditions and conclusions. Once generated, these rules, along with other training

data, become the “model” which will guide the GA-based discovery (see Figure 9.1).

IE task

Rule 1

Doc 1

Part-of-Speech

Tagger

Doc 2

Rule 2

Hypothesis 1

Hypothesis 2

....

Hypothesis k

(k << p)

Doc 3

GA

Role and

Predicate

Rule 3

.

Preprocessed

Data

Hypothesis 3

.

Learning

Recognition

.

Doc n

Rule n

Domain Corpus

Discovered

Novel Hypotheses

Hypothesis 1

Hypothesis 2

....

Hypothesis p

(p << n)

Document

Representation

Initial

LSA training

Population

Knowledge Discovery

Preprocessing and Training

Fig. 9.1. The Evolutionary Model for Knowledge Discovery from Texts

In order to generate an initial set of hypotheses, an initial population is created

by building random hypotheses from the initial rules, that is, hypotheses containing

predicate and rhetorical information from the rules are constructed. The GA then

runs for a number of generations until a fixed number of generations is achieved. At

the end, a small set of the best hypotheses are obtained.

The description of the model is organised as follows: Section 9.3.1 presents the

main features of the text preprocessing phase and how the representation for the

hypotheses is generated. In addition, training tasks which generate the initial knowl-

edge (semantic and rhetorical information) to feed the discovery are described. Sec-

tion 9.3.2 describes constrained genetic operations to enable the hypotheses dis-

covery, and proposes different evaluation metrics to assess the plausibility of the

discovered hypotheses in a multi-objective context.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home