Developing and Improving Measurement Methods - Evaluation Methods in Biomedical Informatics

Biomedical Engineering Reference

In-Depth Information

composing a scale, roughly equal numbers phrased positively and nega-

tively. For example, the set might include both of the following:

My ability to be productive in my job was enhanced by the new computer system.

Strongly agree Agree Neither agree nor disagree

Disagree Strongly disagree

The new system slowed the rate at which I could complete routine tasks.

Strongly agree Agree Neither agree nor disagree

Disagree Strongly Disagree

In this example, the co-presence of items that can be both endorsed and

not endorsed if the respondent feels positively about the system forces the

respondent to attend more closely to the content of the items themselves.

This strategy increases the chance that the respondent will evaluate each

item on its own terms, rather than responding to a global impression. When

analyzing the responses to such item sets, the negatively phrased items

should be reverse coded before each respondent's results are averaged, so

that calculations to estimate reliability give correct results.

A second strategy, useful in situations where one instrument is being used

to assess multiple attributes, is to intermix items that measure different

attributes. This practice is common on psychological instruments to conceal

the attributes measured by the instrument so respondents respond more

honestly and spontaneously. It may not, however, be an advisable strategy

for an instrument used by judges to rate performance. In this case the rating

form should be organized to make the rating process as easy as possible,

and items addressing the same attribute should be clustered together. If a

form is being used to rate some behavior occurring in real time—for

example, the performance by a technician of a lab procedure—it is partic-

ularly important that the form be arrayed as logically as possible so respon-

dents do not have to search for the items they wish to complete.

The Ratings Paradox

There are profound trade-offs involved in making the items on a rating

form more specific. A major part of the art of measurement using ratings

is to identify the right level of specificity or granularity. The greater the

specificity of the items, the less judgment the raters exercise when offering

their opinions, and this will usually generate higher reliability of measure-

ment. However, rating forms that are highly specific in the interest of

generating interrater consistency can become almost mechanical. In the

extreme, raters are merely observing the occurrence of atomic events (“The

end user entered a search term that was spelled correctly”), and their exper-

tise is judges are not being involved at all.

As attributes rated by individual items become less specific and more

global, agreement among raters is more difficult to achieve; as they become

Search WWH ::

Custom Search

Home