Information Technology Reference
In-Depth Information
are “actually implicit and arise from the culture
of practising statistics rather than being found in
books” (Cairns, 2007). This short position paper
does not directly contribute anything new to this
long running debate as there have been several
very eloquent essays within other domains, our
aim is to introduce the mobile-HCI community
to this discussion and raise awareness of some
key papers that discuss the limitations of p-based
null-hypothesis statistical testing.
The paper starts with an introduction to the
key problems raised in the long discussion in the
statistics and experimental psychology domains
and moves on to discuss key suggested alterna-
tives - throughout we will make reference to the
common use of statistics in mobile HCI work. We
feel these issues are relevant to all HCI work but
especially relevant to mobile-HCI. Mobiles are
used in noisy and complex environments in which
the user is often mobile. Experimental design is
now, more often than not, reflecting this complex
environment to some extent - this makes the
studies more complex but also introduces many
more potential compounding variables that might
bias or simply confuse our results. So the magic
formulae of p-testing and ANOVA give us “some
degree of reassurance that we are following good
scientific practices” (Drummond, 2008). But is
this reassurance misplaced or, worse, distorting
the investigative nature of science?
2. Confusing strength of p-value results with
effect size;
3. Abusing the statistical tests themselves;
4. Making conclusions from non-significant
results;
5. Making illogical arguments based on results.
Reviewing recent proceedings of MobileHCI,
we are not as guilty as other domains in which null-
hypothesis testing has been criticised. However,
we tend to be guilty of the first three sins quite
widely and we perceive a risk that as publication
becomes more competitive, reviewers might
push us further along the route of inappropriate
statistics.
1. One of the key problems with NHST that
has been identified in other domains is the
binary treatment of results. The focus on
pre-set levels of statistical significance,
usually p<0.05, leads to simplistic analysis
of results: if this level of significance is
reached authors tend not to probe deeper as
to the reasons and reviewers tend to accept
the claims as valid. On the other hand both
authors and reviewers are often much more
critical of papers where the results do not
reach this level of significance, sometimes
without probing deeper into the reasons.
However, there is nothing magical about
0.05, indeed the fixed level was originally
introduced only for convenience so that
back-of-the-book tables could be produced in
days before computerised stats packages. In
mobile-HCI, as with many domains, we very
rarely consider what level of significance
is required before an experimental result is
meaningful: do we need 95% confidence in
rejecting the null hypothesis or would 90%
do, or do we really need 99.97 for this kind
of result? Reviewing recent mobile-HCI
papers about half of them do not report the
actual p value confirming the binary treat-
ment of this value, our experiments have
KEY PROBLEMS WITH
P-BASED STATISTICS
The debate on null-hypothesis testing has identi-
fied many “sins” of null-hypothesis significance
testing (NHST) and the way that it is normally
used in scientific work. Here we look at them
as we perceive the severity of the problem in
mobile-HCI:
1. Treating NHT as a binary approval of result
validity;
Search WWH ::




Custom Search