Biomedical Engineering Reference
In-Depth Information
composing a scale, roughly equal numbers phrased positively and nega-
tively. For example, the set might include both of the following:
My ability to be productive in my job was enhanced by the new computer system.
Strongly agree Agree Neither agree nor disagree
Disagree Strongly disagree
The new system slowed the rate at which I could complete routine tasks.
Strongly agree Agree Neither agree nor disagree
Disagree Strongly Disagree
In this example, the co-presence of items that can be both endorsed and
not endorsed if the respondent feels positively about the system forces the
respondent to attend more closely to the content of the items themselves.
This strategy increases the chance that the respondent will evaluate each
item on its own terms, rather than responding to a global impression. When
analyzing the responses to such item sets, the negatively phrased items
should be reverse coded before each respondent's results are averaged, so
that calculations to estimate reliability give correct results.
A second strategy, useful in situations where one instrument is being used
to assess multiple attributes, is to intermix items that measure different
attributes. This practice is common on psychological instruments to conceal
the attributes measured by the instrument so respondents respond more
honestly and spontaneously. It may not, however, be an advisable strategy
for an instrument used by judges to rate performance. In this case the rating
form should be organized to make the rating process as easy as possible,
and items addressing the same attribute should be clustered together. If a
form is being used to rate some behavior occurring in real time—for
example, the performance by a technician of a lab procedure—it is partic-
ularly important that the form be arrayed as logically as possible so respon-
dents do not have to search for the items they wish to complete.
The Ratings Paradox
There are profound trade-offs involved in making the items on a rating
form more specific. A major part of the art of measurement using ratings
is to identify the right level of specificity or granularity. The greater the
specificity of the items, the less judgment the raters exercise when offering
their opinions, and this will usually generate higher reliability of measure-
ment. However, rating forms that are highly specific in the interest of
generating interrater consistency can become almost mechanical. In the
extreme, raters are merely observing the occurrence of atomic events (“The
end user entered a search term that was spelled correctly”), and their exper-
tise is judges are not being involved at all.
As attributes rated by individual items become less specific and more
global, agreement among raters is more difficult to achieve; as they become
Search WWH ::




Custom Search