Mining Diagnostic Text Reports by Learning to Annotate Knowledge Roles - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

4

Mining Diagnostic Text Reports by Learning

to Annotate Knowledge Roles

Eni Mustafaraj, Martin Hoof, and Bernd Freisleben

4.1 Introduction

Several tasks approached by using text mining techniques, like text categorization,

document clustering, or information retrieval, operate on the document level, making

use of the so-called bag-of-words model. Other tasks, like document summarization,

information extraction, or question answering, have to operate on the sentence level,

in order to fulfill their specific requirements. While both groups of text mining tasks

are typically affected by the problem of data sparsity, this is more accentuated for

the latter group of tasks. Thus, while the tasks of the first group can be tackled by

statistical and machine learning methods based on a bag-of-words approach alone,

the tasks of the second group need natural language processing (NLP) at the sentence

or paragraph level in order to produce more informative features.

Another issue common to all previously mentioned tasks is the availability of

labeled data for training. Usually, for documents in real world text mining projects,

training data do not exist or are expensive to acquire. In order to still satisfy the

text mining goals while making use of a small contingent of labeled data, several

approaches in machine learning have been developed and tested: different types of

active learning [16], bootstrapping [13], or a combination of labeled and unlabeled

data [1]. Thus, the issue of the lack of labeled data turns into the issue of selecting

an appropriate machine learning approach.

The nature of the text mining task as well as the quantity and quality of available

text data are other issues that need to be considered. While some text mining

approaches can cope with data noise by leveraging the redundancy and the large

quantity of available documents (for example, information retrieval on the Web), for

other tasks (typically those restricted within a domain) the collection of documents

might not possess such qualities. Therefore, more care is required for preparing such

documents for the text mining task.

The previous observations suggest that performing a text mining task on new

and unknown data requires handling all of the above mentioned issues, by combining

and adopting different research approaches. In this chapter, we present an approach

to extracting knowledge from text documents containing diagnostic problem solving

situations in a technical domain (i.e., electrical engineering). In the proposed ap-

proach, we have combined techniques from several areas, including NLP, knowledge

Search WWH ::

Custom Search

Home