Paper Rejected - Human-Computer Interaction and Innovation in Handheld, Mobile and Wearable Technologies

Information Technology Reference

In-Depth Information

are “actually implicit and arise from the culture

of practising statistics rather than being found in

books” (Cairns, 2007). This short position paper

does not directly contribute anything new to this

long running debate as there have been several

very eloquent essays within other domains, our

aim is to introduce the mobile-HCI community

to this discussion and raise awareness of some

key papers that discuss the limitations of p-based

null-hypothesis statistical testing.

The paper starts with an introduction to the

key problems raised in the long discussion in the

statistics and experimental psychology domains

and moves on to discuss key suggested alterna-

tives - throughout we will make reference to the

common use of statistics in mobile HCI work. We

feel these issues are relevant to all HCI work but

especially relevant to mobile-HCI. Mobiles are

used in noisy and complex environments in which

the user is often mobile. Experimental design is

now, more often than not, reflecting this complex

environment to some extent - this makes the

studies more complex but also introduces many

more potential compounding variables that might

bias or simply confuse our results. So the magic

formulae of p-testing and ANOVA give us “some

degree of reassurance that we are following good

scientific practices” (Drummond, 2008). But is

this reassurance misplaced or, worse, distorting

the investigative nature of science?

2. Confusing strength of p-value results with

effect size;

3. Abusing the statistical tests themselves;

4. Making conclusions from non-significant

results;

5. Making illogical arguments based on results.

Reviewing recent proceedings of MobileHCI,

we are not as guilty as other domains in which null-

hypothesis testing has been criticised. However,

we tend to be guilty of the first three sins quite

widely and we perceive a risk that as publication

becomes more competitive, reviewers might

push us further along the route of inappropriate

statistics.

1. One of the key problems with NHST that

has been identified in other domains is the

binary treatment of results. The focus on

pre-set levels of statistical significance,

usually p<0.05, leads to simplistic analysis

of results: if this level of significance is

reached authors tend not to probe deeper as

to the reasons and reviewers tend to accept

the claims as valid. On the other hand both

authors and reviewers are often much more

critical of papers where the results do not

reach this level of significance, sometimes

without probing deeper into the reasons.

However, there is nothing magical about

0.05, indeed the fixed level was originally

introduced only for convenience so that

back-of-the-book tables could be produced in

days before computerised stats packages. In

mobile-HCI, as with many domains, we very

rarely consider what level of significance

is required before an experimental result is

meaningful: do we need 95% confidence in

rejecting the null hypothesis or would 90%

do, or do we really need 99.97 for this kind

of result? Reviewing recent mobile-HCI

papers about half of them do not report the

actual p value confirming the binary treat-

ment of this value, our experiments have

KEY PROBLEMS WITH

P-BASED STATISTICS

The debate on null-hypothesis testing has identi-

fied many “sins” of null-hypothesis significance

testing (NHST) and the way that it is normally

used in scientific work. Here we look at them

as we perceive the severity of the problem in

mobile-HCI:

1. Treating NHT as a binary approval of result

validity;

Human-Computer Interaction and Innovation in Handheld, Mobile and Wearable Technologies

Search WWH ::

Custom Search

Home