Information Technology Reference
In-Depth Information
IF
goal(provide('the basic information ..'))
AND object(analyse('long-term trends ...'))
AND method(study('lands plot using
...'))
THEN
conclusion(improve('soil ..improved after ..'))
Note that causal relations are extracted from individual abstracts. In order to
extract this initial key information from the texts, an IE module was built. Es-
sentially, it takes a set of text documents, has them tagged through a previously
trained Part-of-Speech (POS) tagger (i.e., Brill Tagger), and produces an interme-
diate representation for every document (i.e., template, in an IE sense) which is then
converted into a general rule. A set of hand-crafted domain-independent extraction
patterns were written and coded.
In addition, key training data are captured from the corpus of documents itself
and from the semantic information contained in the rules. This can guide the discov-
ery process in making further similarity judgements and assessing the plausibility of
the produced hypotheses.
Training Information from the Corpus:
It has been suggested that huge amounts of texts represent a valuable source of
semantic knowledge. In particular, in Latent Semantic Analysis (LSA) [21] it is
claimed that this knowledge is at the word level.
Following work by [21] on LSA incorporating structure, we have designed a semi-
structured LSA representation for text data in which we represent predicate
information (i.e., verbs) and arguments (i.e., set of terms) separately once they
have been properly extracted in the IE phase. For this, the similarity is calculated
by computing the closeness between two predicates (and arguments) based on
the LSA data (function SemSim ( P 1 ( A 1 ) ,P 2 ( A 2 ))).
We propose a simple strategy for representing the meaning of the predicates
with arguments. Next, a simple method is developed to measure the similarity
between these units.
Given a predicate P and its argument A , the vectors representing the meaning for
both of them can be directly extracted from the training information provided
by the LSA analysis. Representing the argument involves summing up all the
vectors representing the terms of the argument and then averaging them, as is
usually performed in semi-structured LSA. Once this is done, the meaning vector
of the predicate and the argument is obtained by computing the sum of the two
vectors as used in [33]. If there is more than one argument, then the final vector
of the argument is just the sum of the individual arguments' vectors.
Next, in making further semantic similarity judgements between two predicates
P 1 ( A 1 ) and P 2 ( A 2 ) (i.e., provide('the basic information ..') ), we take their
corresponding previously calculated meaning vectors and then the similarity is
determined by how close these two vectors are. We can evaluate this by comput-
ing the cosine between these vectors which gives us a closeness measure between
1 (complete unrelatedness) and 1 (complete relatedness) [22].
Note however that training information from the texts is not su cient as it only
conveys data at a word semantics level. We claim that both basic knowledge
at a rhetorical, semantic level, and co-occurrence information can be effectively
computed to feed the discovery and to guide the GA.
Accordingly, we perform two kinds of tasks: creating the initial population and
computing training information from the rules.
Search WWH ::




Custom Search