Information Technology Reference
In-Depth Information
KEY SUGGESTED SOLUTIONS
tems over six experimental tasks. The first graph
shows the simplest plot of only means, this plot
gives the impression that one system is better at
the beginning but that performance swaps over
around task 4. The second plot adds error bars
showing the 95% confidence range and shows
clearly that the data overlaps massively at the
beginning and is only likely to be conclusive at
the right-hand side of the graph. Finally, the third
plot replaces the error bars with scatter bars of
the actual data - highlighting the inconclusive
nature of tasks 1 through 5 and that even in task
6 we do not have perfect separation between the
two systems.
Alongside the display of confidence intervals
it would be desirable to report the effect size: a
scaled estimate of the difference between groups.
Reporting the effect size allows for the practical
importance of a result to be determined which
cannot be conveyed through statistical significance
alone. Encouraging both confidence intervals and
effect sizes to be reported enables the reader /
reviewer to evaluate the results of an experiment
more effectively than a p-value alone, regardless
of whether statistical significance was achieved.
Also, by reporting a standardised effect size opens
up the potential for future meta-analysis of re-
lated studies through the use of pooled samples.
Another criticism of HCI is the lack of replication:
other domains base their science on publishing
results that others then replicate to further under-
stand and to confirm (or refute) the original. Ioan-
nidis motives his criticism by highlighting the
“high rate of nonreplication (lack of confirmation)
of research discoveries is a consequence of the
convenience, yet, ill-founded strategy of claiming
conclusive research findings solely on the basis
fo a single study assessed by formal statistical
significance...” (Ioannidis, 2005). In a domain
that does not attempt, nor support publication of,
replicated results - we don't know how bad our
non-replication problem is.
If there is a single lesson from the discussion of
null-hypothesis testing in other domains it is that
the size of the effect should be reported in some
way - usually along with the p-value results. Ef-
fect size tells us how big the observed differences
were while p-values indicate how much confidence
we should attribute to the basic result. There are
two ways of presenting effect size: graphing the
results, which to a large extent is normal practice
in (mobile-)HCI but could still be standardised
somewhat, and using measures of effect size,
which are rare in mobile-HCI papers (but also
probably less informative than graphs). See (De-
nis, 2003) for a discussion of this point and an
extensive and balanced review of alternatives to
null hypothesis testing.
Graphing results is standard procedure in HCI
papers and typically shows much more informa-
tion than straight p-value results (Loftus, 1993)
(Wilkinson, 1999): good graphs show trends over
time/practice and the size of the difference as well
as the range of results. This is good practice and
a subject in which the HCI community deserves
praise over other domains. However, we are not
perfect and the display of error bars on graphs
is not as consistent as it should be: sometimes
they are not present, sometimes they report a
standard deviation, sometimes a standard error
or 95% confidence interval, and sometimes the
absolute range. By graphing suitable confidence
intervals and stating the confidence level of the
estimate, alongside point estimates of the popula-
tion parameter(s), we illustrate visually both the
differences between groups and the reliability of
the estimates made (i.e. the experimental mean
for system A is x and we are 95% confident that
the true mean lies between x-d 1 and x+d 2 ). As
well as reflecting the range of values, confidence
intervals also provide an indication of the sample
size as larger samples will tend to result in tighter
intervals. Figure 1 shows three graphs of the same
data: an artificial experiment comparing two sys-
Search WWH ::




Custom Search