Paper Rejected - Human-Computer Interaction and Innovation in Handheld, Mobile and Wearable Technologies

Information Technology Reference

In-Depth Information

KEY SUGGESTED SOLUTIONS

tems over six experimental tasks. The first graph

shows the simplest plot of only means, this plot

gives the impression that one system is better at

the beginning but that performance swaps over

around task 4. The second plot adds error bars

showing the 95% confidence range and shows

clearly that the data overlaps massively at the

beginning and is only likely to be conclusive at

the right-hand side of the graph. Finally, the third

plot replaces the error bars with scatter bars of

the actual data - highlighting the inconclusive

nature of tasks 1 through 5 and that even in task

6 we do not have perfect separation between the

two systems.

Alongside the display of confidence intervals

it would be desirable to report the effect size: a

scaled estimate of the difference between groups.

Reporting the effect size allows for the practical

importance of a result to be determined which

cannot be conveyed through statistical significance

alone. Encouraging both confidence intervals and

effect sizes to be reported enables the reader /

reviewer to evaluate the results of an experiment

more effectively than a p-value alone, regardless

of whether statistical significance was achieved.

Also, by reporting a standardised effect size opens

up the potential for future meta-analysis of re-

lated studies through the use of pooled samples.

Another criticism of HCI is the lack of replication:

other domains base their science on publishing

results that others then replicate to further under-

stand and to confirm (or refute) the original. Ioan-

nidis motives his criticism by highlighting the

“high rate of nonreplication (lack of confirmation)

of research discoveries is a consequence of the

convenience, yet, ill-founded strategy of claiming

conclusive research findings solely on the basis

fo a single study assessed by formal statistical

significance...” (Ioannidis, 2005). In a domain

that does not attempt, nor support publication of,

replicated results - we don't know how bad our

non-replication problem is.

If there is a single lesson from the discussion of

null-hypothesis testing in other domains it is that

the size of the effect should be reported in some

way - usually along with the p-value results. Ef-

fect size tells us how big the observed differences

were while p-values indicate how much confidence

we should attribute to the basic result. There are

two ways of presenting effect size: graphing the

results, which to a large extent is normal practice

in (mobile-)HCI but could still be standardised

somewhat, and using measures of effect size,

which are rare in mobile-HCI papers (but also

probably less informative than graphs). See (De-

nis, 2003) for a discussion of this point and an

extensive and balanced review of alternatives to

null hypothesis testing.

Graphing results is standard procedure in HCI

papers and typically shows much more informa-

tion than straight p-value results (Loftus, 1993)

(Wilkinson, 1999): good graphs show trends over

time/practice and the size of the difference as well

as the range of results. This is good practice and

a subject in which the HCI community deserves

praise over other domains. However, we are not

perfect and the display of error bars on graphs

is not as consistent as it should be: sometimes

they are not present, sometimes they report a

standard deviation, sometimes a standard error

or 95% confidence interval, and sometimes the

absolute range. By graphing suitable confidence

intervals and stating the confidence level of the

estimate, alongside point estimates of the popula-

tion parameter(s), we illustrate visually both the

differences between groups and the reliability of

the estimates made (i.e. the experimental mean

for system A is x and we are 95% confident that

the true mean lies between x-d 1 and x+d 2 ). As

well as reflecting the range of values, confidence

intervals also provide an indication of the sample

size as larger samples will tend to result in tighter

intervals. Figure 1 shows three graphs of the same

data: an artificial experiment comparing two sys-

Human-Computer Interaction and Innovation in Handheld, Mobile and Wearable Technologies

Search WWH ::

Custom Search

Home