Information Technology Reference
In-Depth Information
IR and data analysis techniques) and text mining: the goal of information access
is to help users find documents that satisfy their information needs, whereas KDT
aims at discovering or deriving novel information from texts, finding patterns across
the documents [17]. Here, two main approaches can be distinguished: those based on
Bag-of-Words representations, and those based on more structured representations.
9.2.1 Bag-of-Words-Based Approaches
Some of the early work on TM came from the Information Retrieval community,
hence the assumption of text represented as a Bag-of-Words (BOW), and then to
be processed via classical DM methods [7, 27]. Since there is additional information
beyond these keywords and issues such as their order do not matter in a BOW
approach, it will usually be referred to as non-structured representation.
Once the initial information (i.e., terms, keywords) has been extracted, KDD
operations can be carried out to discover unseen patterns. Representative methods
in this context have included Regular Associations [6], Concept Hierarchies citeFeld-
man98b, Full Text Mining [27], Clustering , Self-Organising Maps .
Most of these approaches work in a very limited way because they rely on sur-
face information extracted from the texts, and on its statistical analysis. As a con-
sequence, key underlying linguistic information is lost. The systems may be able to
detect relations or associations between items, but they cannot provide any descrip-
tion of what those relations are. At this stage, it is the user's responsibility to look
for the documents involved with those concepts and relations to find the answers.
Thus, the relations are just a “clue” that there is something interesting but which
needs to be manually verified.
9.2.2 High-Level Representation Approaches
Another main stream in KDT involves using more structured or higher-level repre-
sentations to perform deeper analysis so to discover more sophisticated novel / inter-
esting knowledge. Although in general, the different approaches have been concerned
with either performing exploratory analysis for hypothesis formation, or finding new
connections/relations between previously analysed natural language knowledge, it
has also involved using term-level knowledge for other purposes than just statistical
analysis.
Some early research by Swanson on the titles of articles stored in MEDLINE [28]
used an augmented low-level representation (the words in the titles) and exploratory
data analysis to discover hidden connections [30, 32] leading to very promising and
interesting results in terms of answering questions for which the answer was not
currently known. He showed how chains of causal implication within the medical
literature can lead to hypotheses for causes of rare diseases, some of which have
received scientific supporting evidence.
Other approaches using Information Extraction (IE) which inherited some of
Swanson's ideas to derive new patterns from a combination of text fragments, have
also been successful. Essentially, IE is a Natural-Language (NL) technology which
analyses an input NL document in a shallow way by using defined patterns along with
mechanisms to resolve implicit discourse-level information (i.e., anaphora, corefer-
ence, etc.) to match important information from the texts. As a result, an IE task
Search WWH ::




Custom Search