Evaluating Self-Explanations in iSTART: Word Matching, Latent Semantic Analysis, and Topic Models - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

compare all eight systems in terms of the overall quality score by applying each sys-

tem to a database of self-explanation protocols produced by college students. The

protocols had been evaluated by a human expert on overall quality. In Experiment 2,

we investigated two systems using a database of explanations produced by middle-

school students. These protocols were scored to identify particular reading strategies.

6.3.1 Experiment 1

Self-Explanations . The self-explanations were collected from college students who

were provided with SERT training and then tested with two texts, Thunderstorm

and Coal. Both texts consisted of 20 sentences. The Thunderstorm text was self-

explained by 36 students and the Coal text was self-explained by 38 students. The

self-explanations were coded by an expert according to the following 4-point scale: 0

= vague or irrelevant; 1 = sentence-focused (restatement or paraphrase of the sen-

tence); 2 = local-focused (includes concepts from immediately previous sentences);

3 = global-focused (using prior knowledge).

The coding system was intended to reveal the extent to which the participant

elaborated the current sentence. Sentence-focused explanations do not provide any

new information beyond the current sentence. Local-focused explanations might

include an elaboration of a concept mentioned in the current or immediately prior

sentence, but there is no attempt to link the current sentence to the theme of the

text. Self-explanations that linked the sentence to the theme of the text with world

knowledge were coded as “global-focused.” Global-focused explanations tend to use

multiple reading strategies, and indicate the most active level of processing.

Results . Each of the eight systems produces an evaluation comparable to the

human ratings on a 4-point scale. Hence, we calculated the correlations and percent

agreement between the human and system evaluations (see Table 6.2). Additionally,

d primes (d s) were computed for each strategy level as a measure of how well the

system could discriminate among the different levels of strategy use. The d s were

computed from hit and false-alarm rates. A hit would occur if the system assigned

the same self-explanation to a category (e.g., global-focused) as the human judges.

A false-alarm would occur if the system assigned the self-explanation to a category

(e.g., global-focused) that was different from the human judges (i.e., it was not a

global-focused strategy). d s are highest when hits are high and false-alarms are low.

In this context, d s refer to the correspondence between the human and system in

standard deviation units. A d of 0 indicates chance performance, whereas greater

d s indicate greater correspondence.

One thing to note in Table 6.3 is that there is general improvement according to

all of the measures going from left to right. As might be expected, the systems with

LSA fared far better than those without LSA, and the combined systems were the

most successful. The word-based systems tended to perform worse as the evaluation

level increased (from 0 to 3), but performed relatively well at identifying poor self-

explanations and paraphrases. All of the systems, however, identified the sentence-

focused (i.e., 2's) explanations less successfully. However, the d s for the sentence

focused explanations approach 1.0 when LSA is incorporated, particularly when LSA

is combined with the word-based algorithms.

Apart from better performance with LSA than without, the performance is also

more stable with LSA. Whereas the word-based systems did not perform equally

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home