Information Technology Reference
In-Depth Information
keywords), for taking advantage of linguistic knowledge, and for special purpose
ways of producing and assessing the unseen knowledge. The rest of the effort has
concentrated on doing text mining from an Information Retrieval (IR) perspective
and so both representation (keyword based) and data analysis are restricted.
The most sophisticated approaches to text mining or KDT are characterised by
an intensive use of external electronic resources including ontologies, thesauri, etc.,
which highly restricts the application of the unseen patterns to be discovered, and
their domain independence. In addition, the systems so produced have few metrics
(or none at all) which allow them to establish whether the patterns are interesting
and novel.
In terms of data mining techniques, Genetic Algorithms (GA) for Mining pur-
poses has several promising advantages over the usual learning/analysis methods
employed in KDT: the ability to perform global search (traditional approaches deal
with predefined patterns and restricted scope), the exploration of solutions in par-
allel, the robustness to cope with noisy and missing data (something critical in
dealing with text information as partial text analysis techniques may lead to impre-
cise outcome data), and the ability to assess the goodness of the solutions as they
are produced.
In this paper, we propose a new model for KDT which brings together the
benefits of shallow text processing and GAs to produce effective novel knowledge.
In particular, the approach combines Information Extraction (IE) technology and
multi-objective evolutionary computation techniques. It aims at extracting key un-
derlying linguistic knowledge from text documents (i.e., rhetorical and semantic
information) and then hypothesising and assessing interesting and unseen explana-
tory knowledge. Unlike other approaches to KDT, we do not use additional electronic
resources or domain knowledge beyond the text database.
9.2 Related Work
Typical approaches to text mining and knowledge discovery from texts are based on
simple bag-of-words (BOW) representations of texts which make it easy to analyse
them but restrict the kind of discovered knowledge [2]. Furthermore, the discoveries
rely on patterns in the form of numerical associations between concepts (i.e., these
terms will be later referred to as target concepts ) from the documents, which fails
to provide explanations of, for example, why these terms show a strong connection.
Consequently, no deeper knowledge or evaluation of the discovered knowledge is
considered and so the techniques become merely “adaptations” of traditional DM
methods with an unproven effectiveness from a user viewpoint.
Traditional approaches to KDT share many characteristics with classical DM but
they also differ in many ways: many classical DM algorithms [19, 6], are irrelevant
or ill suited for textual applications as they rely on the structuring of data and
the availability of large amounts of structured information [7, 18, 27]. Many KDT
techniques inherit traditional DM methods and keyword-based representation which
are insu cient to cope with the rich information contained in natural-language text.
In addition, it is still unclear how to rate the novelty and/or interestingness of the
knowledge discovered from texts.
Some people suggest that inadequacy and failure to report novel results are likely
because of the confusion between finding/accessing information in texts (i.e., using
Search WWH ::




Custom Search