Information Technology Reference
In-Depth Information
￿
In experiments in which humans use a simulated system, will they know whether
they are interacting with a human or a computer?
￿
Does it need to be a substantial, controlled experiment, or are case studies suffi-
cient? With the former, quantifiable measures are gathered over a large number of
individual tests; with the latter, the goal is to undertake a qualitative analysis of a
few individuals.
As an example of this last point, a student of mine developed a tool for identifying
and prioritizing changes between versions of legal documents. He tested his system
quantitatively, by creating thirty or so pairs of documents (each pair consisting of
an original and a revision of it), and observing how many of 15 readers could find
the changes in each pair, either with the new tool or with an existing approach. He
also tested it qualitatively, by interviewing three pairs of authors who used it for
some weeks as they collaboratively authored three documents. This work produced
convincing results, with the two approaches independently confirming each other. 4
Far too many human studies in computer science are amateurish and invalid.
Instructions to the experimental subjects should be clear; the sample of human sub-
jects should be representative (a class of computer science students may not be
typical of users of mobile devices); the subjects should be unaware of which of the
competing methods under review was proposed by the researcher; anonymity should
be preserved; and controls—analogous to placebos in medical trials—should be in
place. The ethical guidelines for human studies at most universities are far-reaching,
and in all likelihood any investigation involving people evaluating a system needs
ethics clearance.
There is no doubt, however, that human studies are an essential element of com-
puter science research. Without evaluation of the impact on users, for example, it is
difficult to see how to draw strong conclusions concerning a user interface, a search
mechanism, a machine translation system, a software engineering methodology, a
video compression technique, or any of a vast range of contributions. However,
human studies of some questions continue to be a rarity; research in a range of areas
is flawed by lack of measurement of the human element. The fact that such studies
would be expensive is a poor reason for avoiding them; such reasons would not be
acceptable in medicine or psychology.
One of the longest-running experiments in computer science is the TREC evalu-
ation of information retrieval systems at the U.S. National Institute of Standards
and Technology, which has a significant human-factors component. Each year,
participants—a large number of research groups from around the world—apply their
retrieval systems to standard, shared materials with unknown attributes. This side of
the experiment is blind ; the researchers do not know which materials have which
attributes. The output of the systems is then manually evaluated by human assessors,
who label the materials to assign attributes to individual items. This side of the exper-
iment is also blind: the outputs are merged prior to inspection and the assessors do
not know which system has done what. Another aspect of the TREC work is that the
4 Sadly, the findings were that the system was unhelpful.
Search WWH ::




Custom Search