Experimentation - Writing for Computer Science

Information Technology Reference

In-Depth Information

In experiments in which humans use a simulated system, will they know whether

they are interacting with a human or a computer?

Does it need to be a substantial, controlled experiment, or are case studies suffi-

cient? With the former, quantifiable measures are gathered over a large number of

individual tests; with the latter, the goal is to undertake a qualitative analysis of a

few individuals.

As an example of this last point, a student of mine developed a tool for identifying

and prioritizing changes between versions of legal documents. He tested his system

quantitatively, by creating thirty or so pairs of documents (each pair consisting of

an original and a revision of it), and observing how many of 15 readers could find

the changes in each pair, either with the new tool or with an existing approach. He

also tested it qualitatively, by interviewing three pairs of authors who used it for

some weeks as they collaboratively authored three documents. This work produced

convincing results, with the two approaches independently confirming each other. 4

Far too many human studies in computer science are amateurish and invalid.

Instructions to the experimental subjects should be clear; the sample of human sub-

jects should be representative (a class of computer science students may not be

typical of users of mobile devices); the subjects should be unaware of which of the

competing methods under review was proposed by the researcher; anonymity should

be preserved; and controls—analogous to placebos in medical trials—should be in

place. The ethical guidelines for human studies at most universities are far-reaching,

and in all likelihood any investigation involving people evaluating a system needs

ethics clearance.

There is no doubt, however, that human studies are an essential element of com-

puter science research. Without evaluation of the impact on users, for example, it is

difficult to see how to draw strong conclusions concerning a user interface, a search

mechanism, a machine translation system, a software engineering methodology, a

video compression technique, or any of a vast range of contributions. However,

human studies of some questions continue to be a rarity; research in a range of areas

is flawed by lack of measurement of the human element. The fact that such studies

would be expensive is a poor reason for avoiding them; such reasons would not be

acceptable in medicine or psychology.

One of the longest-running experiments in computer science is the TREC evalu-

ation of information retrieval systems at the U.S. National Institute of Standards

and Technology, which has a significant human-factors component. Each year,

participants—a large number of research groups from around the world—apply their

retrieval systems to standard, shared materials with unknown attributes. This side of

the experiment is blind ; the researchers do not know which materials have which

attributes. The output of the systems is then manually evaluated by human assessors,

who label the materials to assign attributes to individual items. This side of the exper-

iment is also blind: the outputs are merged prior to inspection and the assessors do

not know which system has done what. Another aspect of the TREC work is that the

4 Sadly, the findings were that the system was unhelpful.

Writing for Computer Science

Search WWH ::

Custom Search

Home