Information Technology Reference
In-Depth Information
4
Mining Diagnostic Text Reports by Learning
to Annotate Knowledge Roles
Eni Mustafaraj, Martin Hoof, and Bernd Freisleben
4.1 Introduction
Several tasks approached by using text mining techniques, like text categorization,
document clustering, or information retrieval, operate on the document level, making
use of the so-called bag-of-words model. Other tasks, like document summarization,
information extraction, or question answering, have to operate on the sentence level,
in order to fulfill their specific requirements. While both groups of text mining tasks
are typically affected by the problem of data sparsity, this is more accentuated for
the latter group of tasks. Thus, while the tasks of the first group can be tackled by
statistical and machine learning methods based on a bag-of-words approach alone,
the tasks of the second group need natural language processing (NLP) at the sentence
or paragraph level in order to produce more informative features.
Another issue common to all previously mentioned tasks is the availability of
labeled data for training. Usually, for documents in real world text mining projects,
training data do not exist or are expensive to acquire. In order to still satisfy the
text mining goals while making use of a small contingent of labeled data, several
approaches in machine learning have been developed and tested: different types of
active learning [16], bootstrapping [13], or a combination of labeled and unlabeled
data [1]. Thus, the issue of the lack of labeled data turns into the issue of selecting
an appropriate machine learning approach.
The nature of the text mining task as well as the quantity and quality of available
text data are other issues that need to be considered. While some text mining
approaches can cope with data noise by leveraging the redundancy and the large
quantity of available documents (for example, information retrieval on the Web), for
other tasks (typically those restricted within a domain) the collection of documents
might not possess such qualities. Therefore, more care is required for preparing such
documents for the text mining task.
The previous observations suggest that performing a text mining task on new
and unknown data requires handling all of the above mentioned issues, by combining
and adopting different research approaches. In this chapter, we present an approach
to extracting knowledge from text documents containing diagnostic problem solving
situations in a technical domain (i.e., electrical engineering). In the proposed ap-
proach, we have combined techniques from several areas, including NLP, knowledge
Search WWH ::




Custom Search