Happy to tell you that my paper
Controlling the Usability Evaluation Process
under Varying Defect Visibility
was accepted for presentation at HCI2009 in Cambridge, Sep 1-5
This paper is part of my work on estimating how many test sessions are required to push a usability evaluation study towards a certain goal, e.g. 80% of usability problems being discovered. In my last year’s paper at HCI I proved that the current approaches of doing such estimations are flawed.
Percentage of UP discovered = 1-(1-p)^n
p:= basic detection probability (visibility)
n:= number of sessions
With the Nielsen/Landauer formula you are at risk to terminate your study before you actually reached the goal.
The problem is that the geometric series formula introduced by Nielsen and Landauer in 1992 neglects the fact that usability problems differ in their visibility. Varying defect visibility means that there is no single parameter p in the equation. Instead, p varies across usability problems – some are easier to identify than others. Varying problem visibility results in a much slower progress of a study. Thus, the geometric series formula is too optimistic.
This year I will present a formula that improves accuracy by accounting for varying visibility of problems. When running a study you can use this formula to estimate the number of problems that remain undiscovered in the system. You can also compute confidence intervals for this estimate.
The new formula will allow you to make an informed decision whether to finish a study or to continue with further test persons.
Heterogeneity in the Usability Evaluation Process
accepted as a full paper on HCI2008 in Liverpool
This is the next work on the usability evaluation process. Since 1992 the Virzi/Nielsen/Landauer/ formula was the only choice to predict how many users to test for a certain goal (say 80% of the usability defects). The formula is the cumulative geometric function (CGF):
also known as the curve of diminishing returns, because the closer you approach 100% the more users you have to test to detect new defects.
This formula is inherently flawed. The problem is the single p. There is no single p! Instead, evaluators differ in their skills and some defects are much harder to detect than others. Past researchers believed that the formula still holds with the mean of p. This is not the case!
When defect heterogeneity occurs, the evaluation process proceeds slowlier than predicted by the CGF. Industrial practitioners planning and controlling their usability studies take a considerable risk of stopping the study too early. Defects remain undetected, which is harmful.
If usability evaluations is your business, attend the conference, read the paper and here the talk in order to learn:
- how to detect heterogeneity in usability studies
- how to determine heterogeneity impact
- and how to deal with it.
In the paper to be presented on the CHI2008 (Florence, Italy) I propose the Rasch model for measuring the impact factors on usability evaluations. In particular the skills of inspectors and the difficulty of defects is at question.
The Rasch model is feared at least for two reasons:
- it’s an axiomatic theory and the data has to fulfill some strict properties to be Rasch scalable
- large sample sizes are reqired to get good estimates
Just at the moment I’m reanalysing two data sets of usability evaluations I found in the literature
A mixed picture appears. Well, the standard tests on goodness-of-fit give fair results. This suggests that the Rasch model holds for the data. But, a subsequent reliability analysis (split-half procedure) gives good results for items only. The much more interesting reliability of inspector skill parameters is very poor instead.
Many have guessed it, some have ignored it.
But, my recent results are definite: Usability defects differ in how easy they are detected and in most cases evaluators differ in their skills to detect them. This is also the case for participants in usability testing studies.
If you read the old papers of Virzi and Landauer & Nielsen on predicting the usability evaluation process you find them
- argue that there are differences between defects and/or evaluators
- but then forget that and choose a the cumulative geometric function as a prediction model, which only has one single parameter p
Now, I have treated five data sets from the literature with some quite advanced statistical techniques. The results are clear:
- usability defects differ in their “detectability” regardless of the method or any other context factors
- in three of five studies the detection skills of evaluators differed
- only in two data sets from highly systematic experiments the evaluator skills appeared homogeneous.
Sounds like academic subtleties? Read the latest papers of Lewis. He still has problems to find a good estimator for p. A further step in my analysis revealed: Heterogeneity of defects is accountable for harmful overestimation in the cumulative geometric process model, which is used since the papers of Virzi.
Lewis, J. R. Evaluation of Procedures for Adjusting Problem-Discovery Rates Estimated From Small Samples International Journal of Human-Computer Interaction, 2001, 13, 445-479
Nielsen, J. & Landauer, T. K. A mathematical model of the finding of usability problems CHI ’93: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press, 1993, 206-213
Virzi, R. A. Refining the Test Phase of Usability Evaluation: How many Subjects is enough? Human Factors, 1992, 34, 457-468
Blogged with Flock
glad to tell you, that our full paper “Intrducing Item Response Theory for Measuring the Usability Evaluation Process” is accepted for the CHI 2008 in Florence, Italy.
In this paper we treat the problem of heterogeneity in the usability evaluation process. Former stochastic models of Virzi, Nielsen&Landauer and Lewis employ the curve of diminishing returns P=1-(1-p)^n for predicting the outcome of an evaluation process. The problem with this model is the unique parameter p: It is unlikely that the probability of an individual evaluator detecting a usability defect is always the same.
As a solution we introduce the Rasch model for modeling the process. This model allows for a unique p_ij for each pair of evaluator and defect. With the Rasch model two basic parameters are introduced: the difficulty of defects and the skills of evaluators. The Rasch model allows for measuring both on a truly metric scale.
Admittedly, this is kind of very theoretical stuff (although I prefer the term “groundbreaking” ). But we tried hard to explain the basic concepts and procedures and describe three practical scenarios. At the time of the CHI, these scenarios will be available for download as runnable programs in the statistical computing environment R.
Some photos from the HCI2007, Lancaster are online. Find them at http://www.flickr.com/photos/schmettow/tags/hci2007/
You find photo streams from others with the tag hci2007. But be aware that there also appear photos from the HCI International in Beijing.
Powered by ScribeFire.
In brief: Never miss a HCI again. It was a well organized conference with very nice and interesting people. The food was better than expected and the English humour lived up to the expectations.
Particular arousing was the keynote of Jared Spool. He acted as an evangelist, a capitalist and a magican. He communicated a very clear vision of what the research challenges in near futures are :
He made a very clear argument on HCI as an engineering discipline (as opposed to a craft). Usability is becoming a mission-critical quality in large-scale ecommerce applications. He suspects , that finding usability effects is still far from being a development activity with reproducable outcome. In his words, the question is not any longer if testing 5 or 8 users suffices to find 80% of the defects. In contrast, companies will ask for a guarantee, that 99% or 99,9% (he was not sure about the counts of nines) of the defects are trapped…
… and resolved. Spool was very optimistic with the concept of usability patterns, which he considers to have design knowledge at hand if an application has to (re)designed.
You can guess that these claims excited me, because I’m doing active research on improving a pattern-based inspection method and on an advanced measurement model for evaluation processes.
Powered by ScribeFire.
Van Welie is well known for his great collections of usability patterns including web patterns and a few for mobile platforms. Now he has updated his website:
An interesting fact: He dropped the previous distinction between UI platforms. This is consequent, because with the raise of Web 2.0 and increasingly powerful mobile platforms the constraint differences between platforms diminish.
Instead van Welie chose to arrange patterns on three primary levels:
- User needs patterns cover many aspects of the basic interaction (like navigation, data input etc.)
- Application needs is a little unclear at first. But I guess this covers those problems that primarily address design forces (i.e. constraints).
- Finally, Context patterns present solutions regarding user requirements and business goals.
Personally, I don’t like the unspecific notion of context. I’d rather prefer to call the latter Use and Business patterns. Or requirements patterns.
But, anyhow, thanks for activily maintaining this useful source of usability knowledge.