Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

keywords), for taking advantage of linguistic knowledge, and for special purpose

ways of producing and assessing the unseen knowledge. The rest of the effort has

concentrated on doing text mining from an Information Retrieval (IR) perspective

and so both representation (keyword based) and data analysis are restricted.

The most sophisticated approaches to text mining or KDT are characterised by

an intensive use of external electronic resources including ontologies, thesauri, etc.,

which highly restricts the application of the unseen patterns to be discovered, and

their domain independence. In addition, the systems so produced have few metrics

(or none at all) which allow them to establish whether the patterns are interesting

and novel.

In terms of data mining techniques, Genetic Algorithms (GA) for Mining pur-

poses has several promising advantages over the usual learning/analysis methods

employed in KDT: the ability to perform global search (traditional approaches deal

with predefined patterns and restricted scope), the exploration of solutions in par-

allel, the robustness to cope with noisy and missing data (something critical in

dealing with text information as partial text analysis techniques may lead to impre-

cise outcome data), and the ability to assess the goodness of the solutions as they

are produced.

In this paper, we propose a new model for KDT which brings together the

benefits of shallow text processing and GAs to produce effective novel knowledge.

In particular, the approach combines Information Extraction (IE) technology and

multi-objective evolutionary computation techniques. It aims at extracting key un-

derlying linguistic knowledge from text documents (i.e., rhetorical and semantic

information) and then hypothesising and assessing interesting and unseen explana-

tory knowledge. Unlike other approaches to KDT, we do not use additional electronic

resources or domain knowledge beyond the text database.

9.2 Related Work

Typical approaches to text mining and knowledge discovery from texts are based on

simple bag-of-words (BOW) representations of texts which make it easy to analyse

them but restrict the kind of discovered knowledge [2]. Furthermore, the discoveries

rely on patterns in the form of numerical associations between concepts (i.e., these

terms will be later referred to as target concepts ) from the documents, which fails

to provide explanations of, for example, why these terms show a strong connection.

Consequently, no deeper knowledge or evaluation of the discovered knowledge is

considered and so the techniques become merely “adaptations” of traditional DM

methods with an unproven effectiveness from a user viewpoint.

Traditional approaches to KDT share many characteristics with classical DM but

they also differ in many ways: many classical DM algorithms [19, 6], are irrelevant

or ill suited for textual applications as they rely on the structuring of data and

the availability of large amounts of structured information [7, 18, 27]. Many KDT

techniques inherit traditional DM methods and keyword-based representation which

are insu cient to cope with the rich information contained in natural-language text.

In addition, it is still unclear how to rate the novelty and/or interestingness of the

knowledge discovered from texts.

Some people suggest that inadequacy and failure to report novel results are likely

because of the confusion between finding/accessing information in texts (i.e., using

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home