Evolving Explanatory Novel Patterns for Semantically-Based Text Mining - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

IR and data analysis techniques) and text mining: the goal of information access

is to help users find documents that satisfy their information needs, whereas KDT

aims at discovering or deriving novel information from texts, finding patterns across

the documents [17]. Here, two main approaches can be distinguished: those based on

Bag-of-Words representations, and those based on more structured representations.

9.2.1 Bag-of-Words-Based Approaches

Some of the early work on TM came from the Information Retrieval community,

hence the assumption of text represented as a Bag-of-Words (BOW), and then to

be processed via classical DM methods [7, 27]. Since there is additional information

beyond these keywords and issues such as their order do not matter in a BOW

approach, it will usually be referred to as non-structured representation.

Once the initial information (i.e., terms, keywords) has been extracted, KDD

operations can be carried out to discover unseen patterns. Representative methods

in this context have included Regular Associations [6], Concept Hierarchies citeFeld-

man98b, Full Text Mining [27], Clustering , Self-Organising Maps .

Most of these approaches work in a very limited way because they rely on sur-

face information extracted from the texts, and on its statistical analysis. As a con-

sequence, key underlying linguistic information is lost. The systems may be able to

detect relations or associations between items, but they cannot provide any descrip-

tion of what those relations are. At this stage, it is the user's responsibility to look

for the documents involved with those concepts and relations to find the answers.

Thus, the relations are just a “clue” that there is something interesting but which

needs to be manually verified.

9.2.2 High-Level Representation Approaches

Another main stream in KDT involves using more structured or higher-level repre-

sentations to perform deeper analysis so to discover more sophisticated novel / inter-

esting knowledge. Although in general, the different approaches have been concerned

with either performing exploratory analysis for hypothesis formation, or finding new

connections/relations between previously analysed natural language knowledge, it

has also involved using term-level knowledge for other purposes than just statistical

analysis.

Some early research by Swanson on the titles of articles stored in MEDLINE [28]

used an augmented low-level representation (the words in the titles) and exploratory

data analysis to discover hidden connections [30, 32] leading to very promising and

interesting results in terms of answering questions for which the answer was not

currently known. He showed how chains of causal implication within the medical

literature can lead to hypotheses for causes of rare diseases, some of which have

received scientific supporting evidence.

Other approaches using Information Extraction (IE) which inherited some of

Swanson's ideas to derive new patterns from a combination of text fragments, have

also been successful. Essentially, IE is a Natural-Language (NL) technology which

analyses an input NL document in a shallow way by using defined patterns along with

mechanisms to resolve implicit discourse-level information (i.e., anaphora, corefer-

ence, etc.) to match important information from the texts. As a result, an IE task

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home