Information Technology Reference
In-Depth Information
text classification, there are promising approaches, which stand for different learning
paradigms, among them, support vector machines (SVM) are one of the most
promising solutions (Joachims 1998). AIS has successfully applied SVM to different
classification problems - topic detection and author identification (Kindermann,
Diederich et al. 2002), multi-class classification (Kindermann, Paaß & Leopold 2001)
- on different linguistic corpora: Reuters newswire, English and German newspapers
(Leopold & Kindermann 2002), as well as radio-broadcastings (Eickeler, Kindermann
et al. 2002). The major problem of applying text classification techniques in the
indiGo project is the amount of data. The training of a SVM requires some hundred
positive and negative examples for each class to be considered. These data must be
collected in the group discussions. The contributions in a discussion group have to
annotated with respect to the desired classes by the moderator.
An especially challenging task to text mining systems is to map the unstructured
natural text to a structured internal representation (basically a set of data objects).
indiGo requires to map text documents generated in the group discussions to
structured information of project experiences. However, the limited scope of the
indiGo-project - many roles can only be fulfilled by a finite number of subjects (e.g.
the number of IESE's employees or costumers is finite) - makes it possible to invent
simplifying solutions to many problems, which are not feasible in the general case.
The context of an utterance consists of all elements in a communicative situation
that determine the understanding of an utterance in a systematic way. Context divides
up into verbal and non-verbal context (Bußmann 1990). Non-verbal context cannot -
or at best to a small extent - be conveyed in written text. Abstracting away from the
non-verbal context of the situation which a text (spoken or written) is produced,
means, that the lost information has to be substituted by linguistic means in order to
avoid misunderstandings resulting from the loss of information. This is why spoken
and written language differ. Speaker and hearer are exposed to the same contextual
situation, which disambiguates their utterances, whereas writer and reader - in the
traditional sense of the word - are not.
Computer-mediated communication adopts an intermediate position in this respect.
Writer and reader react on each other's utterances as speaker and hearer do. They are
in the same communicative situation. But their opportunity to convey non-verbal
information is limited as well as the chance to obtain information about the contextual
situations of their counterparts.
The context of the communicative situation becomes crucial in the IndiGo setting
when discussions are condensed to project experiences. The communicative situation
of the discussion is lost and respective information has to be added to the natural
language data. This limits the degree of information compaction of linguistic data.
Consequently the decontextualization suggested in Figure 1 has to be carefully
performed in order to not end up in compressed but nevertheless senseless "structured
information". How and to what extent information about the communicative situation
can be concentrated or discarded is an interesting research objective of the indiGo
project.
To provide the moderator with information about the problem-orientation of the
participants in a discussion we propose an “index of speciality of language”, which
can be calculated on the basis of the agreement of the vocabulary of writer and reader.
Self-organizing maps (SOM) (Kohonen 2001) (Merkl 1997) can give an overview
Search WWH ::




Custom Search