Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

•

Interestingness ( How interesting is the hypothesis in terms of its antecedent

and consequent? ):

Unlike other approaches to measure “interestingness” which use an external

resource (e.g., WordNet) and rely on its organisation, we propose a different

view where the criterion can be evaluated from the semi-structured information

provided by the LSA analysis. Accordingly, the measure for hypothesis H is

defined as a degree of unexpectedness as follows:

interestingness(H)= <Semantic Dissimilarity between Antecedent

and

Consequent>

That is, the lower the similarity, the more interesting the hypothesis is likely

to be, so the dissimilarity is measured as the inverse of the LSA similarity.

Otherwise, it means the hypothesis involves a correlation between its antecedent

and consequent which may be an uninteresting known common fact [26].

•

Coherence: This metrics addresses the question whether the elements of the

current hypothesis relate to each other in a semantically coherent way. Unlike

rules produced by DM techniques in which the order of the conditions is not an

issue, the hypotheses produced in our model rely on pairs of adjacent elements

which should be semantically sound, a property which has long been dealt with

in the linguistic domain, in the context of text coherence [10].

As we have semantic information provided by the LSA analysis which is com-

plemented with rhetorical and predicate-level knowledge, we developed a simple

method to measure coherence, following work by [10] on measuring text coher-

ence.

Semantic coherence is calculated by considering the average semantic similarity

between consecutive elements of the hypothesis. However, note that this closeness

is only computed on the semantic information that the predicates and their

arguments convey (i.e., not the roles) as the role structure has been considered

in a previous criterion. Accordingly, the criterion can be expressed as follows:

Coherence(H)= ( |H|− 1)

i =1

SemSim ( P

i ( A

i ) ,P

i +1 ( A

i +1 ))

( |H|− 1)

where ( | H |− 1) denotes the number of adjacent pairs, and SemSim is the

LSA-based semantic similarity between two predicates.

•

Coverage: The coverage metric tries to address the question of how much the

hypothesis is supported by the model (i.e., rules representing documents and

semantic information).

Coverage of a hypothesis has usually been measured in KDD approaches by con-

sidering some structuring in data (i.e., discrete attributes) which is not present

in textual information. Besides, most of the KDD approaches have assumed the

use of linguistic or conceptual resources to measure the degree of coverage of the

hypotheses (i.e., match against databases, positive examples).

In order to deal with the criterion in the context of KDT, we say that a generated

hypothesis H covers an extracted rule R i (i.e., rule extracted from the original

training documents, including semantic and rhetorical information) only if the

predicates of H are roughly (or exactly, in the best case) contained in R i .

Formally, the rules covered are defined as:

RulesCovered(H) = { R i ∈ RuleSet |∀P j ∈ R i ∃HP k ∈ HP :

( SemSim ( HP k ,P j ) ≥ threshold ∧ predicate ( HP k )= predicate ( P j )) }

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home