Information Technology Reference
In-Depth Information
Interestingness ( How interesting is the hypothesis in terms of its antecedent
and consequent? ):
Unlike other approaches to measure “interestingness” which use an external
resource (e.g., WordNet) and rely on its organisation, we propose a different
view where the criterion can be evaluated from the semi-structured information
provided by the LSA analysis. Accordingly, the measure for hypothesis H is
defined as a degree of unexpectedness as follows:
interestingness(H)= <Semantic Dissimilarity between Antecedent
and
Consequent>
That is, the lower the similarity, the more interesting the hypothesis is likely
to be, so the dissimilarity is measured as the inverse of the LSA similarity.
Otherwise, it means the hypothesis involves a correlation between its antecedent
and consequent which may be an uninteresting known common fact [26].
Coherence: This metrics addresses the question whether the elements of the
current hypothesis relate to each other in a semantically coherent way. Unlike
rules produced by DM techniques in which the order of the conditions is not an
issue, the hypotheses produced in our model rely on pairs of adjacent elements
which should be semantically sound, a property which has long been dealt with
in the linguistic domain, in the context of text coherence [10].
As we have semantic information provided by the LSA analysis which is com-
plemented with rhetorical and predicate-level knowledge, we developed a simple
method to measure coherence, following work by [10] on measuring text coher-
ence.
Semantic coherence is calculated by considering the average semantic similarity
between consecutive elements of the hypothesis. However, note that this closeness
is only computed on the semantic information that the predicates and their
arguments convey (i.e., not the roles) as the role structure has been considered
in a previous criterion. Accordingly, the criterion can be expressed as follows:
Coherence(H)= ( |H|− 1)
i =1
SemSim ( P
i ( A
i ) ,P
i +1 ( A
i +1 ))
( |H|− 1)
where ( | H |− 1) denotes the number of adjacent pairs, and SemSim is the
LSA-based semantic similarity between two predicates.
Coverage: The coverage metric tries to address the question of how much the
hypothesis is supported by the model (i.e., rules representing documents and
semantic information).
Coverage of a hypothesis has usually been measured in KDD approaches by con-
sidering some structuring in data (i.e., discrete attributes) which is not present
in textual information. Besides, most of the KDD approaches have assumed the
use of linguistic or conceptual resources to measure the degree of coverage of the
hypotheses (i.e., match against databases, positive examples).
In order to deal with the criterion in the context of KDT, we say that a generated
hypothesis H covers an extracted rule R i (i.e., rule extracted from the original
training documents, including semantic and rhetorical information) only if the
predicates of H are roughly (or exactly, in the best case) contained in R i .
Formally, the rules covered are defined as:
RulesCovered(H) = { R i ∈ RuleSet |∀P j ∈ R i ∃HP k ∈ HP :
( SemSim ( HP k ,P j ) threshold predicate ( HP k )= predicate ( P j )) }
Search WWH ::




Custom Search