Database Reference
In-Depth Information
SIDEBAR: SAMPLE SIZE: HOW MANY PARTICIPANTS DO YOU NEED
FOR A USABILITY TEST?—cont'd
We usually try to have a sample size of at least 8 for our usability studies. Here's why:
1. Although you usually do see the same problems start to repeat after the irst couple of partici-
pants, there are times when your irst couple of participants are outliers. That is, they either
zoom through all the tasks without inding a problem or they have dificulty with all of the tasks.
In the scenario where your irst 2 or 3 participants out of 5 fall into the “unrepresentative” cat-
egory, you're relying on only 2 or 3 to “normalize” the data. That's a risk we'd rather not take.
On the other hand, if you conduct the test with at least 8 participants, you have at least some
“unrepresentativeness buffer” even if you run into 2 or 3 unrepresentative data points. That is,
the impact of the outliers (which you do not know are outliers at that moment) is blunted some-
what by the larger sample size.
2. When you conduct your post task-completion rating scales, you're going to have a much better
chance of avoiding the extremely wide (and, thus, not so useful) conidence intervals that can
plague a small sample size. That means you're going to be much more conident of the rating
scale results you deliver. And, often, we ind that those rating scale results can complement the
task completion rates, bolstering your case.
Here's an example. Let's assume that only 2 out of 8 participants are able to complete the
task “ind a pair of running shoes in your size” on the retail clothing site you're testing. After the
test, participants are asked to rate their agreement with the statement “Finding running shoes in
my size is easy” on a scale of 1 to 5, where 1 = Strongly Disagree and 5 = Strongly Agree. Let's
assume that there was an even split between “1” and “2” (4 each) for an average of 1.5 and a
resulting 95% conidence interval for the true mean mating of 1.5 ± 0.45.
Now assume that you ran the same test with only 4 participants and only 1 out of 4
participants was able to “ind a pair of running shoes in your size.” (This is the same 0.25
proportion of successful completions as in the example with a sample size of 8.) Again, after
the test, participants are asked to rate their agreement with the statement “Finding running
shoes in my size is easy” on a scale of 1 to 5, where 1 = Strongly Disagree and 5 = Strongly
Agree. Again, let's assume that there was an even split between “1”s and “2”s (2 each). This
time, you still have an average of 1.5, but now your conidence interval has more than doubled
in size to 1.5 ± 0.92!
In either case, your post-test rating scale data complement your task completion data, and
bolsters the case that you really do have a problem with users inding shoes in their sizes. But,
in the above example, with a sample size of 8, your conidence interval size was less than half
of that for a sample size of 4, meaning you have, basically, more than doubled the accuracy or
precision in your result. In a nutshell, you've greatly bolstered your case that users have a big
problem inding running shoes in their size.
3. The preparation for creating and preparing a test for 4 versus 8 is almost the same. That is, it's
the same amount of work to write up a test plan, deine the tasks, get consensus on the tasks,
and coordinate the assets for the test whether you're testing for 4 or 8. Admittedly, it's going to
take longer to recruit and actually run the tests, but it's probably a difference of only one day
of testing. But you will probably be able to report out your indings with much more statistical
authority. It's analogous to making a big pot of chili for Sunday's football game; the prep time
is the same whether you feed 2 or 8, and you'll invariably have some chili left over.
4. The larger sample sizes will also decrease the binomial conidence intervals for your actual
proportion of task completions; you'll learn more about this topic in Chapter 4.
For an excellent treatment on sample sizes—and speciically how to calculate exactly the
correct number you'll need for different types of tests—we enthusiastically recommend “Quan-
tifying the User Experience; Practical Statistics for User Research” by Jeff Sauro and James R.
Lewis.
 
Search WWH ::




Custom Search