Information Technology Reference
In-Depth Information
have much predictive power by themselves, the system can effectively create “derived
attributes” with greater predictive power) to come up with new rules.
A common representation used for this kind of task encodes attributes and values
of a rule in a binary string of rule conditions and rule consequent. Suppose that an
individual represents a rule antecedent with a single attribute-value condition, where
the attribute Marital status and its values can be “single,” “married,” ”divorced,”
and “widow.” A possible representation would be a condition involving this attribute
encoded by four bits, so the string “0110” (i.e., the second and third values of the
attribute are present) would represent the antecedent IF marital status=married
OR divorced) using internal disjunctions (i.e., logical OR).
One general aspect worth noting in applying GAs for DM tasks is that both the
representation used for the discovery and the evaluation carried out assume that the
source data are properly represented in a structured form (i.e., database) in which
the attributes and values are easily handled.
When dealing with text data, these working assumptions are not always plau-
sible because of the complexity of text information. In particular, mining text data
using evolutionary algorithms requires a certain level of representation which cap-
tures knowledge beyond discrete data (i.e., semantics). Thus there arises the need for
new operations to create knowledge from text databases. In addition, fitness evalu-
ation also imposes important challenges in terms of measuring novel and interesting
knowledge which might be implicit in the texts or be embedded in the underlying
semantics of the extracted data.
Applying evolutionary methods to TM/KDT is a very recent research topic.
With the exception of the work of [1] on the discovery of semantic relations no other
research effort is under way as far as we know as the most promising KDT techniques
have been tackled with more traditional search/learning methods.
The advantage over a similar approach for discovery of unseen relations as in
[16], is that this approach provides more robust results in a way that exploits a wider
number of possible hypotheses in the search space. In addition, the IE patterns finally
used for the extraction are automatically learned, whereas for [16], these need to be
handcrafted. Although the obtained relations have been evaluated in terms of their
coverage in WordNet, the subjective quality of this unseen knowledge has not been
assessed from a KDD viewpoint as no user has been involved in the process.
9.3 A Semantically Guided Model for Effective Text
Mining
We developed a semantically guided model for evolutionary Text Mining which
is domain-independent but genre-based. Unlike previous approaches to KDT, our
approach does not rely on external resources or descriptions hence its domain-
independence. Instead, it performs the discovery only using information from the
original corpus of text documents and from the training data generated from them.
In addition, a number of strategies have been developed for automatically evaluating
the quality of the hypotheses (“novel” patterns). This is an important contribution
on a topic which has been neglected in most of KDT research over the last years.
We have adopted GAs as central to our approach to KDT. However, for proper
GA-based KDT there are important issues to be addressed including representa-
 
Search WWH ::




Custom Search