Information Technology Reference
In-Depth Information
9.3.1 Text Preprocessing and Training
The preprocessing phase has two main goals: to extract important information from
the texts and to use that information to generate both training data and the initial
population for the GA.
In terms of text preprocessing (see first phase in Figure 9.1), an underlying
principle in our approach is to be able to make good use of the structure of the doc-
uments for the discovery process. It is well-known that processing full documents
has inherent complexities [23], so we have restricted our scope somewhat to consider
a scientific genre involving scientific/technical abstracts. These have a well-defined
macro-structure (genre-dependent rhetorical structure) to “summarise” what the
author states in the full document (i.e., background information, methods, achieve-
ments, conclusions, etc).
Unlike patterns extracted for usual IE purposes such as in [18, 19, 20], this macro-
structure and its roles are domain-independent but genre-based, so it is relatively
easy to translate it into different contexts.
As an example, suppose that we are given the following abstract where bold
sequences of words indicate the markers triggering the IE patterns:
The
current
study
aims
to
provide
GOAL the basic information about the fertilisers system, specially in its
nutrient dynamics.
OBJECT Long-term trends of the soil's chemical and physical fertility
were
also
analysed.
The
methodology
is
based
on
the
METHOD study of lands' plots using different histories of usage of crop
rotation with fertilisers
in
order
to
detect
long-term
changes.
...
Finally,
a
deep
checking
of
data
allowed
us
to
conclude
that
CONCLUSION soils have improved after 12 years of continuous rotation.
From such a structure, important constituents can be identified:
Rhetorical Roles (discourse-level knowledge): these indicate important places
where the author makes some “assertions” about his/her work (i.e., the author
is stating the goals, used methods, achieved conclusions, etc.). In the exam-
ple above, the roles are represented by goal, object of study, method and
conclusion .
Predicate Relations: these are represented by actions (predicate and arguments)
which are directly connected to the role being identified and state a relation
which holds between a set of terms (words which are part of a sentence), a
predicate and the role which they are linked to. Thus, for the example, they are
as follows: provide('the basic information ..'), analyse('long-term trends
...'), study('lands plot using ...'), improve('soil ..improved after ..')
Causal Relation(s): Although there are no explicit causal relations in the above
example, we can hypothesise a simple rule of the form:
IF the current goals are G1,G2, .. and the means/methods used
M1,M2, .. (and any other constraint/feature) THEN it is true that
we can achieve the conclusions C1,C2, ..
Finally, the sample abstract may be represented in a rule-like form as follows:
Search WWH ::




Custom Search