### Five Users and Damaged Merchandize

A note on the last decades two main debates in Usability Engineering

In the early nineties, when the discipline of Usability Engineering had its childhood, some authors shifted the focus from theoretical design considerations and methodological concerns to what can be called process quality: As the usability budget in software projects is never satisfactory, the question arouse, how many users, inspectors respectively, are enough to early catch most of the usability defects in a design. Virzi (1992), as well as Nielsen and Landauer (1993), modelled the defect detection process with the simple formula of diminishing returns, which is:

P(+|n,p) = 1-(1-p)^n

where P(+n,p) denotes the probability of a single defect to be identified after n independent sessions; p is the basic probability of a defect to be identified in one session. This formulas corresponding graph asymptotically reaches 1. This formula has been enhanced by other researchers to incorporate, for example, the ability of the inspectors. But the basic idea remains the same:

To measure the effectiveness characteristics of a certain method is a preliminary of streamlining the quality assurance process.

The “Five users is (NOT) enough” debate didn’t arouse from the mathematical model in the first place , but simply from the fact, that both studies determined p to be around 0.35. This yielded a 80% detection rate with five users/inspectors, but can, of course, vary with the circumstances. Current studies already investigate in the sources of variance. Thus we can expect better predictions of p in future, provided that the underlying factors can be measured efficiently.

In the late nineties the “Damaged Merchandize” debate arouse: Gray & Salzman (1998) strongly criticized the far-reaching conclusions that researchers drew from poorely designed comparative studies of usability evaluation methods (UEM). A comparative UEM study simply tries to inform practitioners if they should give method A the favor against method B for usability evaluation, because it detects defects more effectively. The current research on UEM comparison focusses mostly on aspects of sound study design and finding valid criteria for effectiveness.

What I want to remark here, is, that both areas of research are strongly interrelated:

Whereas the primary goal in comparative studies is to rank two methods, process predictability requires to exactly estimate p. But the p estimate automatically yields a ranking of two methods. Vice versa, the p estimate can only produce valid comparisons and process predictions, when the study is well designed for generalizability and validity.

