Biomedical Engineering Reference
In-Depth Information
TABLE 5.6. Hypercritic demonstration study results in contingency table format.
Pooled rating by judges
Comment valid
Comment not valid
Hypercritic
(
5 judges)
(
<
5 judges)
Total
Comment generated
145
24
169
Comment not generated
55
74
129
Total
200
98
298
additional analyses therefore would be performed. Contingency table
methods, discussed in detail in Chapter 8, might be used to look at the
number and nature of the disagreements between the resource and the
judges. Such an analysis would require the investigators to choose a thresh-
old level corresponding to the number of judges' endorsements a comment
would require in order to be considered correct. Doing this reduces inter-
val to ordinal the level of measurement of correctness; and, as such, results
in some loss of information. (All comments coded as “correct” are consid-
ered to be equally correct, and all comments coded as “incorrect” are con-
sidered to be equally incorrect.) The original authors of the study used
endorsement of a comment by five or more judges as a criterion for overall
correctness. Using this same criterion, the data may be mapped into a con-
tingency table as shown in Table 5.6.
The contingency table analysis illustrated in Table 5.6 is useful because
it shows that two different kinds of errors occur in roughly equal propor-
tion, if endorsement by five or more judges is taken as the threshold for
considering a comment to be correct. Hypercritic failed to generate 55 of
the 200 comments (28%) that were endorsed by five or more judges. Hyper-
critic did generate 24 of the 98 comments (24%) rated incorrect by the
judges, because fewer than five judges endorsed them. Note that these error
rates depend on the investigator's choice of a threshold.
Self-Test 5.5
Assume that the data in Table 5.5, based only on 12 comments, constitute
a complete pilot study. The reliability of these data, based on 12 comments
(objects) and eight judges (observations) is 0.29. (Note that this illustrates
the danger of conducting measurement studies with small samples of
objects, as the reliability estimated from this small sample is different
from that obtained with the full sample of 298 comments). For this pilot
study:
1. What is the standard error of measurement of the “correctness of a
comment” as determined by these eight judges?
2. If there were four judges instead of eight, what would be the estimated
reliability of the measurement? What if there were 10 judges?
 
Search WWH ::




Custom Search