Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

itself (How is the hypothesis supported from the initial text documents? How in-

teresting is it?). Accordingly, we have defined eight evaluation criteria to assess the

hypotheses (i.e., in terms of Pareto dominance, it will produce a 8-dimensional vector

of objective functions) given by: relevance, structure, cohesion, interesting-

ness, coherence, coverage, simplicity, plausibility of origin .

The current hypothesis to be assessed will be denoted as H , and the training

rules as R i . Evaluation methods (criteria) by which the hypotheses are assessed and

the questions they are trying to address are as follows:

•

Relevance

Relevance addresses the issue of how important the hypothesis is to target con-

cepts. This involves two concepts (i.e., terms), as previously described, related

to the question:

What is the best set of hypotheses that explain the relation between < term 1 >

and < term 2 >?

Considering the current hypothesis, it turns into a specific question: how good

is the hypothesis in explaining this relation?

This can be estimated by determining the semantic closeness between the hy-

pothesis' predicates (and arguments) and the target concepts 2 by using the

meaning vectors obtained from the LSA analysis for both terms and predicates.

Our method for assessing relevance takes these issues into account along with

some ideas of Kintsch's Predication. Specifically, we use the concept of Strength

[21]: strength ( A, I )= f ( SemSim ( A, I ) , SemSim ( P, I ))) between a predicate

with arguments and surrounding concepts (target concepts in our case) as a

part of the relevance measure, which basically decides whether the predicate

(and argument) is relevant to the target concepts in terms of the similarity

between both predicate and argument, and the concepts.

We define the function f as proposed by [21] to give a relatedness measure such

that high values are obtained only if both the similarity between the target con-

cept and the argument ( α ), and target concept and the predicate ( β ) exceed

some threshold. Next, we highlight the closeness by determining the square dif-

ference between each similarity value and the desired value (1.0). If we take the

average square difference, we obtain an error metric which is a Mean Square Er-

ror (MSE). As we want to get low error values so to encourage high closeness, we

subtract MSE from 1. Formally, f ( α, β ) is therefore computed as the function:

f ( α, β )= 1 − MSE ( {α, β} ) if both α and β > threshold

0

Otherwise

where the MSE is the Mean Square Error between the similarities and the desired

value ( Vd =1 . 0), is calculated as:

MSE( { list of n values v i } )= n i =1 ( v i − Vd ) 2

In order to account for both target concepts, we just take the average of strength

for both terms. So, the overall relevance becomes:

relevance ( H )= 2 |H|

strength ( P i ,A i ,<term 1 > )+ strength ( P i ,A i ,<term 2 > )

|H|

2 Target concepts are relevant nouns in our experiment. However, in a general case,

these might be either nouns or verbs.

i =1

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home