Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

have much predictive power by themselves, the system can effectively create “derived

attributes” with greater predictive power) to come up with new rules.

A common representation used for this kind of task encodes attributes and values

of a rule in a binary string of rule conditions and rule consequent. Suppose that an

individual represents a rule antecedent with a single attribute-value condition, where

the attribute Marital status and its values can be “single,” “married,” ”divorced,”

and “widow.” A possible representation would be a condition involving this attribute

encoded by four bits, so the string “0110” (i.e., the second and third values of the

attribute are present) would represent the antecedent IF marital status=married

OR divorced) using internal disjunctions (i.e., logical OR).

One general aspect worth noting in applying GAs for DM tasks is that both the

representation used for the discovery and the evaluation carried out assume that the

source data are properly represented in a structured form (i.e., database) in which

the attributes and values are easily handled.

When dealing with text data, these working assumptions are not always plau-

sible because of the complexity of text information. In particular, mining text data

using evolutionary algorithms requires a certain level of representation which cap-

tures knowledge beyond discrete data (i.e., semantics). Thus there arises the need for

new operations to create knowledge from text databases. In addition, fitness evalu-

ation also imposes important challenges in terms of measuring novel and interesting

knowledge which might be implicit in the texts or be embedded in the underlying

semantics of the extracted data.

Applying evolutionary methods to TM/KDT is a very recent research topic.

With the exception of the work of [1] on the discovery of semantic relations no other

research effort is under way as far as we know as the most promising KDT techniques

have been tackled with more traditional search/learning methods.

The advantage over a similar approach for discovery of unseen relations as in

[16], is that this approach provides more robust results in a way that exploits a wider

number of possible hypotheses in the search space. In addition, the IE patterns finally

used for the extraction are automatically learned, whereas for [16], these need to be

handcrafted. Although the obtained relations have been evaluated in terms of their

coverage in WordNet, the subjective quality of this unseen knowledge has not been

assessed from a KDD viewpoint as no user has been involved in the process.

9.3 A Semantically Guided Model for Effective Text

Mining

We developed a semantically guided model for evolutionary Text Mining which

is domain-independent but genre-based. Unlike previous approaches to KDT, our

approach does not rely on external resources or descriptions hence its domain-

independence. Instead, it performs the discovery only using information from the

original corpus of text documents and from the training data generated from them.

In addition, a number of strategies have been developed for automatically evaluating

the quality of the hypotheses (“novel” patterns). This is an important contribution

on a topic which has been neglected in most of KDT research over the last years.

We have adopted GAs as central to our approach to KDT. However, for proper

GA-based KDT there are important issues to be addressed including representa-

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home