Information Technology Reference
In-Depth Information
compare all eight systems in terms of the overall quality score by applying each sys-
tem to a database of self-explanation protocols produced by college students. The
protocols had been evaluated by a human expert on overall quality. In Experiment 2,
we investigated two systems using a database of explanations produced by middle-
school students. These protocols were scored to identify particular reading strategies.
6.3.1 Experiment 1
Self-Explanations . The self-explanations were collected from college students who
were provided with SERT training and then tested with two texts, Thunderstorm
and Coal. Both texts consisted of 20 sentences. The Thunderstorm text was self-
explained by 36 students and the Coal text was self-explained by 38 students. The
self-explanations were coded by an expert according to the following 4-point scale: 0
= vague or irrelevant; 1 = sentence-focused (restatement or paraphrase of the sen-
tence); 2 = local-focused (includes concepts from immediately previous sentences);
3 = global-focused (using prior knowledge).
The coding system was intended to reveal the extent to which the participant
elaborated the current sentence. Sentence-focused explanations do not provide any
new information beyond the current sentence. Local-focused explanations might
include an elaboration of a concept mentioned in the current or immediately prior
sentence, but there is no attempt to link the current sentence to the theme of the
text. Self-explanations that linked the sentence to the theme of the text with world
knowledge were coded as “global-focused.” Global-focused explanations tend to use
multiple reading strategies, and indicate the most active level of processing.
Results . Each of the eight systems produces an evaluation comparable to the
human ratings on a 4-point scale. Hence, we calculated the correlations and percent
agreement between the human and system evaluations (see Table 6.2). Additionally,
d primes (d s) were computed for each strategy level as a measure of how well the
system could discriminate among the different levels of strategy use. The d s were
computed from hit and false-alarm rates. A hit would occur if the system assigned
the same self-explanation to a category (e.g., global-focused) as the human judges.
A false-alarm would occur if the system assigned the self-explanation to a category
(e.g., global-focused) that was different from the human judges (i.e., it was not a
global-focused strategy). d s are highest when hits are high and false-alarms are low.
In this context, d s refer to the correspondence between the human and system in
standard deviation units. A d of 0 indicates chance performance, whereas greater
d s indicate greater correspondence.
One thing to note in Table 6.3 is that there is general improvement according to
all of the measures going from left to right. As might be expected, the systems with
LSA fared far better than those without LSA, and the combined systems were the
most successful. The word-based systems tended to perform worse as the evaluation
level increased (from 0 to 3), but performed relatively well at identifying poor self-
explanations and paraphrases. All of the systems, however, identified the sentence-
focused (i.e., 2's) explanations less successfully. However, the d s for the sentence
focused explanations approach 1.0 when LSA is incorporated, particularly when LSA
is combined with the word-based algorithms.
Apart from better performance with LSA than without, the performance is also
more stable with LSA. Whereas the word-based systems did not perform equally
Search WWH ::




Custom Search