Archive for the ‘English’ Category

Usability Inspection Methods: Keen competition on a low level?

Heard about the Usability Pattern Inspection (UPI), a promising new usability evaluation method? This method supports practitioners not only in detecting usability defects, but also helps them to propose better design alternatives.
But, before we can insist on this advantage in competition, the obligation is to prove, that the UPIs defect detection capability is comparable to established inspection methods, especially the top dog Heuristic Evaluation.
To get a first idea of the UPIs performance it was compared to the HE in a small sample inspection experiment. The defects identified by the participants were afterwards verified via falsification testing.
The results are promising!
The performance criteria  thoroughness, validity and effectiveness were compared and the participant using the UPI performed practically equal to the HE group in all disciplines. The UPI is thus capable of capturing as many true defects as the HE and doesn’t produce more false hits. In other words, there is at the moment no argument for practitioners to prefer the HE and not to profit from the design recommendations the UPI was designed for.
But the results are also disappointing!
Inspection methods are usually employed, because they are easy to lean and quick to apply. But in our experiment, we got very low values for the identification pf usability defects:

  • six HE inspectors together found 28 true defects
  • four UPI inspectors together found 22 true defects
  • four sessions of falsification usability testing yielded 86 true defects.

This computes to a probability of .15 for a  single  inspector to detect a  certain defect. According to the curve of diminishing returns (Virzis formula) a group of 10 inspectors was needed to capture 80% of all defects with this low base probability. Of course this was under strict laboratory conditions with a quite short time of learning and inspecting. But it is still quite low, compared to the subsequent usability test, which uncovered 86 defects with just four sessions.


Ubuntu Mobile and Embedded Edition

An announcement the Ubuntu developer mailing list said, that their will be a special distri for mobile devices. After seven years of struggle to have Linux run on Notebooks I can only say: COOL! I hope they will actually take notebooks into account, which todays are at the heavy weight end of mobile devices. But I would also like to see Ubuntu Linux running with full desktop on embedded devices like all those mediaboxes, NAS devices or routers.

Powered by ScribeFire.

Preparation of Falsification Usability Testing

In recent research on evaluation Usability inspection methods falsificatory usability tests (FUT) are employed in order to identify false alarms. False alarms are those defects predicted by at least one inspector, which are not true defects, meaning that they do not cause actual usage problems.
Thus, a FUT aims at challenging those defects predicted in an inspection experiment and to classify them as either true or false. This is opposed to exploratory usability tests, which are employed to identify as many true defects as possible.
As a consequence a FUT can be constructed much narrower in scope and with much stricter observation scheme. This can be acomplished with the following procedure:

  1. Constructing the observation scheme
    1. Collect all defect identifications from the inspection experiment.
    2. Normalize the defects across inspectors resulting in a set of unique defect descriptions.
    3. Gather all comments from the inspectors and group them with the normalized defects.
    4. Review the defects and compile a set of observations for each defects: Use the inspector comments, the inspection task, where the identification occurred and the defect description to predict observable usage problems the defect is likely to cause.
  2. Construction of testing material
    1. Gather an initial set of testing tasks which cover all parts of the tested application, were defects were identified.
    2. Assign defects to tasks, which are likely to occur with the task. If the set of defects is large, this can be done with an intermediate step: First assign defects to dialogue elements and dialogue elements to tasks. This will result in a rough asignment of defects to tasks. Than review the initial set of defects per tasks and eliminate those, which are not likely to occur.
    3. Review the set of defects assigned to tasks:
      • assure that each defect is at covered at least by one task
      • if a defect is covered by very few tasks, make sure that those tasks will most likely challenge the defect
      • if a defect is covered by many tasks, eliminate the defect from those tasks, where it is less likely to occur
    4. Prepare an observation record with the following structure:
      • Task
        • Dialogue Element
          • Defect
            • Observations

      (Have a look at the example at the end of the page)

  3. Run the test
    1. Prepare the usability test as usual.
    2. Remind the user to:
      • think aloud
      • strictly follow the order of testing tasks
    3. Use the observation record to gather the observations for each user.
  4. Analyse the data: 
    1. Keep in mind, that with a FUT it suffices to have only one observation for each defect. Thus, the observation protocol can be reduced by already observed defects after each session. Even the task set can be reduced, if there are no open defects for one task.
    2. Classify the defects:
      • defects observed in the FUT become true defects
      • defects not observed become  false defects
    3. Review your response matrix from the inspection protocoll. You should now be able to classify each identification event as hit, miss or false alarm.

The classified response matrix can now serve to compute quality criteria for the evaluated inspection method. In particular the validity can be estimated. The validity is a measure of how fre the method is of false alarms. But keep in mind, that the FUT will not provide the preliminaries to estimate the thoroughness of the method, in that the FUT does not explore for previously unknown defects.

Front Focussing of Sigma 24mm/1.8, Nikon 85mm/1.8 and Tamron 17-50mm/2.8

Surprise, Surprise! Got my Sigma 24mm back (remember the quick comparison with a Tamron 17-50 a few weeks ago). A professional photographer bought it and returned it a few weeks later, because of focussing issues. He suspected that it might suffer from Front Focussing. This is a problem with some third party lenses, where the autofocus focusses a point in front of the object and not exactly the object plane.  A short explanation and a quick test can be found at A more thorough test procedure is here. So I tested the Sigma and the two other lenses I own with the quick prodedure on my Nikon D80. This procedure also reveals tendencies of chromatic aberations in the defocus areas.

The Nikon 85mm/1.8 focusses perfectly fine, but shows strong vertical chromatic aberrations in the defocus areas: Green behind the focussing plane and magenta in front. The Tamron 17-50mm/2.8 focusses well, with – maybe – a slight tendency towards back focussing. There is nearly no chromatic aberration. The Sigma shows a clear front focussing affect and mild chromatic aberations.

Have a look at the test images:

Nikon Nikkor 85mm/1.8

Sigma 24mm/1.8 EX DG Macro

Tamron 17-50mm/2.8

Shock: My Nikon got its feet wet

On saturday we went on a first hiking excursion in the Bavarian forrest (the Lusen, ~1360m). I made some nice shots of the dying forrest at the south east mountain side. When we lunched at the Lusen hut with an incredible shock I noticed that my water bag had dripped and water had already entered my camera (Nikon D80). Fog was visible at the display and inside the viewer. Immediatly I removed the battery. Back at home I stored the camera and the lense at a sunny warm place. Both were open but covered with a black sleave. Today everything seems to be fine again. After carefully cleaning the body inside and the lense with a brush and a Rocket Air.
So branded I’m looking for a holster bag that is really waterproof. I think I’m going for a Ortlieb Aqua Zoom. Ortlieb gear usually is tough, absolutely waterproof and reasonable light. Also they provide a four point strap to fix the camera on your belly. Must be great for outdoor activities…

Comparison of Sigma EX DG 24mm/1.8 versus Tamron 17-50mm/2.8 XR DiII

Today I got my new lense. A Tamron 17-50mm/2.8. This will probably replace (and augment) my Sigma 24mm/1.8. Though I liked the handy focal width and fastness of the Sigma, I often suspected it to produce quite unsharp and dull pictures. A quick comparison with a few shots at aperture 2.8 exposes the truth. The Tamron is the better lense in any way (except that it’s slightly slower, of course). It is much sharper, brilliant and produces less chromatic aberations and flares even wide open. Have a look!

Sigma EX DG 24mm/1.8 at F2.8 Tamron SP AF 17-50mm/2.8 XR DiII at F2.8
DSC_3687.JPG DSC_3688.JPG
DSC_3688_crop.JPG DSC_3687_crop.JPG
DSC_3689.JPG DSC_3691.JPG

Measurement and Noise

Types of noise

Consider noise in digital photography (and find a well illustrated introduction here). I will summarize:
Noise is a deviation or error from the true picture, which is totally unrelated to the picture. Unrelated means, that one cannot predict the appearance of noise pixel if the true picture is known.

Basically, there are two types of noise on digital sensors:

  1. Noise that stems from unregularities of a particular piece of sensor. This noise is still unrelated to the picture, but it is predictable from one exposure to another.
  2. Totally random noise. Not predictable by anything.

Removing noise

It is easy to remove the totally random noise. Just take two or more exposures of the same true picture and average them. Since the noise is not correlated with the picture nor the camera it will diminish.

It is a little trickier to remove the sensor related noise. If you took several exposures of the same true picture with the same camera, the noise would also sum up and not be removed.
There are two things you can do: Substraction of the sensors noise profile or multiple exposures with different cameras.

If you choose the substractive approach, you first have to assess the sensors noise profile. How to do that? Simply take multiple exposures of different random pictures (e.g. sections of a white wall). Average the images. What remains wil be the noise profile of your sensor. You can now substract it from every future picture you make.

The alternative is to take the picture with different cameras. As the noise is only related to a specific piece of sensor, it will diminish with averaging.


“Why is this piece of blog in the category Usability Research?”

“Because there are phenomena regarding noise in usability evaluations. And we can learn from drawing the analogy.”

Up to this we can distinguish the following terms:

  • The measured object is what we want to represent as exact as possible
  • We use a certain instrument, …
  • … which produces instrument specific noise
  • Additionally we have to fight totally random noise.
  • We have the ability to make multiple independant measures with the same instrument …
  • … and multiple measures with different instruments

Multiple measures with the same instrument will remove totally random noise with the same object or will identify the instrument specific noise when different objects are measured.
Or we can measure the same object with different instruments to remove instrument related deviations.

How does this apply to usability evaluation methods? Next time!

Defects, problems and uncertainty in falsification testing

Usability Inspection Methods are widely used to predict usage problems with a product. The outcome is usually a list of identified usability defects in the design of the product. Accordingly, as Woolrych et. al. (2004) point out, a defect identification is only valid, if evidence for a usage problem can be found. They suggest Falsification User Testing to achieve this. FUT is done subsequently to the assessment study, and the testing tasks are tightly focussed on the identified defects. Thus, a test person is very likely to stumble on it, assumed it exists.

Of course, there exists a certain amount of uncertainty. Defect identification and usage problem identification are stochastic processes (remember the curve of diminishing returns) and thus, a true problem might not be observed with your finite set of test persons.

What I want to remark here are two things:

  1. Terminology:
    • A user usually has a usage goal, and thus (s)he can also have a usage problem.
    • On the other hand: products have attributes, like usability, and they can have defects which lower the attribute, thus usability defect.
    • A UI designer has the goal of designing for usability. hence, he can have either a design problem (more general, as it also takes other -Bilities into account) or a usability problem.
  2. The before mentioned uncertainty is not too relevant. The severity of a usability defect should be a function of the frequency and the impact of resulting usage problems (Rubin ,1994). Hence, if a usage problem is not observed in a reasonable sample, it won’t have a high frequency and likely not a strong impact. This uncertainty should therefor prominently appear with minor defects.

Alan Woolrych; Gilbert Cockton & Mark Hindmarch Falsification Testing for Usability Inspection Method Assessment Proceedings of the HCI 2004, 2004

Rubin, J. Handbook of Usability Testing. How to Plan, Design, and Conduct Effective Tests John Wiley & Sons, 1994

HDR Panoramas with free tools (Part 1, Overview)

After my first naive trials of creating high dynamic panoramas I digged a little deeper. I started figuring out a workflow, which

  • is completely HDR, no intermediate Jpegs or Pngs
  • uses only free open source tools

At the moment my workflow is designed as follows:

  1. Exposures in RAW with my Nikon D80, triple bracketing for each panorama section
  2. Combination of each bracketing set with Qtpfsgui
  3. Panorama Stitching with Hugin (v0.7b4)
  4. Retouching and Cropping with Cinepaint
  5. Tonemapping and Jpeg Export with Qtpfsgui

Each of the above tools supports one or more HDR formats (hdr, pfs, openexr, tiff), but the intersection is incomplete. Thus, the following table shows, which formats to load and save in each step:

HDR Panorama Workflow with Input and Output Formats
Nr Step Tool Open as Save as Comment
1 Exposure Digital camera RAW
2 HDR Combination Qtpfsgui RAW HDR
3 Panorama Stitching Hugin HDR TIFF (IEEE 32bit) Automatic control points with autopano-sift and the PTStitcher don’t work with HDR. Define control points manually and stitch with nona
Intermediate Converting PFStools TIFF OpenEXR TIFFs saved from Hugin won’t show up in Cinepaint
4 Cropping and Retouching Cinepaint OpenEXR OpenEXR Adding an alpha channel required before saving
5 Tonemapping Qtpfsgui OpenEXR JPEG My favorite algo is fattal02

Powered by ScribeFire.

Five Users and Damaged Merchandize

A note on the last decades two main debates in Usability Engineering

In the early nineties, when the discipline of Usability Engineering had its childhood, some authors shifted the focus from theoretical design considerations and methodological concerns to what can be called process quality: As the usability budget in software projects is never satisfactory, the question arouse, how many users, inspectors respectively, are enough to early catch most of the usability defects in a design. Virzi (1992), as well as Nielsen and Landauer (1993), modelled the defect detection process with the simple formula of diminishing returns, which is:

P(+|n,p) = 1-(1-p)^n

where P(+n,p) denotes the probability of a single defect to be identified after n independent sessions; p is the basic probability of a defect to be identified in one session. This formulas corresponding graph asymptotically reaches 1. This formula has been enhanced by other researchers to incorporate, for example, the ability of the inspectors. But the basic idea remains the same:

To measure the effectiveness characteristics of a certain method is a preliminary of streamlining the quality assurance process.

The “Five users is (NOT) enough” debate didn’t arouse from the mathematical model in the  first place , but simply from the fact, that both studies determined p to be around 0.35. This yielded a 80% detection rate with five users/inspectors, but can, of course, vary with the circumstances. Current studies already investigate in the sources of variance. Thus we can expect better predictions of p in future, provided that the underlying factors can be  measured efficiently.

In the late nineties the “Damaged Merchandize” debate arouse: Gray & Salzman (1998) strongly criticized the far-reaching conclusions that researchers drew from poorely designed comparative studies of usability evaluation methods (UEM). A comparative UEM study simply tries to inform practitioners if they should give method A the favor against method B for usability evaluation, because it detects defects more effectively. The current research on UEM comparison focusses mostly on aspects of sound study design and finding valid criteria for effectiveness.

What I want to remark here, is, that both areas of research are strongly interrelated:
Whereas the primary goal in comparative studies is to rank two methods, process predictability requires to exactly estimate p. But the p estimate automatically yields a ranking of two methods. Vice versa, the p estimate can only produce valid comparisons and process predictions, when the study is well designed for generalizability and validity.

Nielsen, J. & Landauer, T. K. A mathematical model of the finding of usability problems CHI ’93: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press, 1993, 206-213

Virzi, R. A. Refining the Test Phase of Usability Evaluation: How many Subjects is enough? Human Factors, 1992, 34, 457-468

Gray, W. D. & Salzman, M. C. Damaged Merchandise? A Review of Experiments that Compare Usability Evaluation Methods Human-Computer Interaction, 1998, 13, 203-261

Powered by ScribeFire.