Information Technology Reference
In-Depth Information
either achieved this magical number and thus
our results are important or they have not
and are thus no better than random: clearly
a gross simplification.
2. Most statistics topics, and people who use
statistics for experimental analysis, know
quite clearly that you are more likely to get
a statistically significant result with more
people. What is less clear is that, if there
is a statistically significant result in there,
then the value of p is inversely related to the
number of subjects - the more people you
study the smaller the p value will become.
What is not strictly related to the p-value is
the size of the effect. A study with a large
number of users will most likely find a sta-
tistically significant effect but that does not
mean that the effect is meaningful, large or
scientifically significant - it may be a trivial
difference that would never be noticed in real
use never mind have a commercial benefit.
However, the smaller the sample the less
likely the sample is to be representative of the
real population and, thus, “true” (Ioannidis,
2005). Not reporting effect size in some form
becomes especially dangerous when linked
with our binary thinking of probability.
3. Most statistical procedures (including the
standard t-tests and ANOVA) make strong
assumptions about the underlying data and
are invalid if these assumptions are not
met. In particular, they assume the data is
taken from an underlying population that is
normally distributed. In many psychological
tests, e.g. reaction time, it is assumed that
the whole population will follow a normal
distribution - this is not true for many mobile
tasks and experiments. For example, in text
entry there is a very wide range of abilities
and it is hard to assess the underlying popu-
lation spread - there are many people with
high performance but a long and important
tail. There are techniques to overcome this
problem (either use of non-parametric tests
or adjusting the data, say by using log values
for times) but the discussion of parametric
checking rarely happens in experimental
papers. Furthermore, the distribution in “the
population” also differs greatly depending on
what the underlying population is expected
to be - and we rarely report what underlying
population we are studying: again for text
entry, is it all mobile users, regular 12-key
users, teenagers, twin-thumbers,...? Cairns
discusses these and other statistical problems
in a review of the use of statistics in British
HCI Conference papers (Cairns, 2007).
4. While other domains are more guilty of this
than HCI, there is still sometimes a tendency
to want to spin a non-significant result into
a significant non-result . This spin negates
the whole point of null-hypothesis testing:
the authors are trying to use NHST to argue
exactly what it is meant to prevent. When we
use NHST we are trying to say “the chances
of this happening randomly are very low
so we have a meaningful difference”, the
negation is “the chances of this happening
randomly are not very low so we have no
clear result” and not “the chances of this
happening by chance are high, therefore
there is no real difference”.
5. NHST tests tell us the probability that the
observed data occurs by chance given the
null-hypothesis is true, usually the prob-
ability that we would observe this data given
that there is no difference in performance of
two systems on a certain measure. This is
not the same as the probability that they are
the same, nor is 1-p the same as the prob-
ability of there being a difference. This is a
fairly complex argument involving Baysian
probability and modus-tollens validity, we
direct the reader to Cohen (Cohen, 1994)
for discussion and examples.
Search WWH ::




Custom Search