Developing and Improving Measurement Methods - Evaluation Methods in Biomedical Informatics

Biomedical Engineering Reference

In-Depth Information

to generalize from that large number of cases without having to expose

every object to every case. In measurement studies, generalizability theory

(see Appendix A) provides a way to estimate sources of measurement

errors for nested designs.

Judges Facet

The judges facet enters into a measurement problem whenever informed

human judges assess specific aspects of the quality of an activity or a

product they are observing. Judges become central to measurement in

informatics for situations where there are no reference standards or correct

answers for the attribute(s) under study. In these situations, the considered

opinions of human experts are the best option to generate a measured

score. A study might employ experts to judge the quality of the interactions

between patients and clinicians, as the clinicians enter patient data into an

information resource during the interaction. In another example, observers

may assess key aspects of the interaction of end users with a new informa-

tion resource during a beta test. As with any measurement process, the

primary concern is the correlation among the independent observations—

in this situation, the judges—and the resulting number of judges required

to obtain a reliable measurement. A set of “well-behaved” judges, all of

whom correlate with one another to an acceptable extent when rating a

representative sample of objects, can be said to form a scale. A large

literature on performance assessment by judges speaks in more detail to

many of the issues addressed here. 18-20

Sources of Variation Among Judges

Ideally, all judges of the same object, using the same criteria and forms to

record their opinions, should render highly correlated judgments. All vari-

ation should then be among objects. Many factors that erode interjudge

agreement are well known and have been well documented 21 :

1. Interpretation or logical effects: Judges may differ in their interpreta-

tions of the attribute(s) to be rated and the meanings of the items on the

forms on which they record their judgments. They may give similar ratings

to attributes that are logically related in their own minds.

2. Judge tendency effects: Some judges are consistently overgenerous or

lenient; others are consistently hypercritical or stringent. Others do not

employ the full set of response options on a form, locating all of their ratings

in a narrow region, which is usually at the middle of the range. This phe-

nomenon is known as a “central tendency” effect.

3. Insufficient exposure: Sometimes the logistics of a study require that

judges base their judgments on less exposure to the objects than is neces-

sary to come to an informed conclusion. This may occur, for example, if

investigators schedule 10 minutes of observation of end users working with

Search WWH ::

Custom Search

Home