The indiGo Project: Enhancement of Experience Management and Process Learning with Moderated Discourses - Advances in Data Mining

Information Technology Reference

In-Depth Information

text classification, there are promising approaches, which stand for different learning

paradigms, among them, support vector machines (SVM) are one of the most

promising solutions (Joachims 1998). AIS has successfully applied SVM to different

classification problems - topic detection and author identification (Kindermann,

Diederich et al. 2002), multi-class classification (Kindermann, Paaß & Leopold 2001)

- on different linguistic corpora: Reuters newswire, English and German newspapers

(Leopold & Kindermann 2002), as well as radio-broadcastings (Eickeler, Kindermann

et al. 2002). The major problem of applying text classification techniques in the

indiGo project is the amount of data. The training of a SVM requires some hundred

positive and negative examples for each class to be considered. These data must be

collected in the group discussions. The contributions in a discussion group have to

annotated with respect to the desired classes by the moderator.

An especially challenging task to text mining systems is to map the unstructured

natural text to a structured internal representation (basically a set of data objects).

indiGo requires to map text documents generated in the group discussions to

structured information of project experiences. However, the limited scope of the

indiGo-project - many roles can only be fulfilled by a finite number of subjects (e.g.

the number of IESE's employees or costumers is finite) - makes it possible to invent

simplifying solutions to many problems, which are not feasible in the general case.

The context of an utterance consists of all elements in a communicative situation

that determine the understanding of an utterance in a systematic way. Context divides

up into verbal and non-verbal context (Bußmann 1990). Non-verbal context cannot -

or at best to a small extent - be conveyed in written text. Abstracting away from the

non-verbal context of the situation which a text (spoken or written) is produced,

means, that the lost information has to be substituted by linguistic means in order to

avoid misunderstandings resulting from the loss of information. This is why spoken

and written language differ. Speaker and hearer are exposed to the same contextual

situation, which disambiguates their utterances, whereas writer and reader - in the

traditional sense of the word - are not.

Computer-mediated communication adopts an intermediate position in this respect.

Writer and reader react on each other's utterances as speaker and hearer do. They are

in the same communicative situation. But their opportunity to convey non-verbal

information is limited as well as the chance to obtain information about the contextual

situations of their counterparts.

The context of the communicative situation becomes crucial in the IndiGo setting

when discussions are condensed to project experiences. The communicative situation

of the discussion is lost and respective information has to be added to the natural

language data. This limits the degree of information compaction of linguistic data.

Consequently the decontextualization suggested in Figure 1 has to be carefully

performed in order to not end up in compressed but nevertheless senseless "structured

information". How and to what extent information about the communicative situation

can be concentrated or discarded is an interesting research objective of the indiGo

project.

To provide the moderator with information about the problem-orientation of the

participants in a discussion we propose an “index of speciality of language”, which

can be calculated on the basis of the agreement of the vocabulary of writer and reader.

Self-organizing maps (SOM) (Kohonen 2001) (Merkl 1997) can give an overview

Advances in Data Mining

Search WWH ::

Custom Search

Home