Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

9.3.1 Text Preprocessing and Training

The preprocessing phase has two main goals: to extract important information from

the texts and to use that information to generate both training data and the initial

population for the GA.

In terms of text preprocessing (see first phase in Figure 9.1), an underlying

principle in our approach is to be able to make good use of the structure of the doc-

uments for the discovery process. It is well-known that processing full documents

has inherent complexities [23], so we have restricted our scope somewhat to consider

a scientific genre involving scientific/technical abstracts. These have a well-defined

macro-structure (genre-dependent rhetorical structure) to “summarise” what the

author states in the full document (i.e., background information, methods, achieve-

ments, conclusions, etc).

Unlike patterns extracted for usual IE purposes such as in [18, 19, 20], this macro-

structure and its roles are domain-independent but genre-based, so it is relatively

easy to translate it into different contexts.

As an example, suppose that we are given the following abstract where bold

sequences of words indicate the markers triggering the IE patterns:

The

current

study

aims

to

provide

GOAL the basic information about the fertilisers system, specially in its

nutrient dynamics.

OBJECT Long-term trends of the soil's chemical and physical fertility

were

also

analysed.

The

methodology

is

based

on

the

METHOD study of lands' plots using different histories of usage of crop

rotation with fertilisers

in

order

to

detect

long-term

changes.

...

Finally,

a

deep

checking

of

data

allowed

us

to

conclude

that

CONCLUSION soils have improved after 12 years of continuous rotation.

From such a structure, important constituents can be identified:

•

Rhetorical Roles (discourse-level knowledge): these indicate important places

where the author makes some “assertions” about his/her work (i.e., the author

is stating the goals, used methods, achieved conclusions, etc.). In the exam-

ple above, the roles are represented by goal, object of study, method and

conclusion .

•

Predicate Relations: these are represented by actions (predicate and arguments)

which are directly connected to the role being identified and state a relation

which holds between a set of terms (words which are part of a sentence), a

predicate and the role which they are linked to. Thus, for the example, they are

as follows: provide('the basic information ..'), analyse('long-term trends

...'), study('lands plot using ...'), improve('soil ..improved after ..')

•

Causal Relation(s): Although there are no explicit causal relations in the above

example, we can hypothesise a simple rule of the form:

IF the current goals are G1,G2, .. and the means/methods used

M1,M2, .. (and any other constraint/feature) THEN it is true that

we can achieve the conclusions C1,C2, ..

Finally, the sample abstract may be represented in a rule-like form as follows:

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home