Pattern Inspection Paper accepted at British HCI

I’m glad to announce that our paper  A Pattern-based Usability Inspection Method: First Empirical Performance Measures and Future Issues (together with Sabine Niebuhr) is accepted at the British HCI conference in Lancaster. Although the sample in the experiment was quite small, most reviewers were pleased to see empirical results and a truely novel inspection method appearing.

Advertisements

Usability Inspection Methods: Keen competition on a low level?

Heard about the Usability Pattern Inspection (UPI), a promising new usability evaluation method? This method supports practitioners not only in detecting usability defects, but also helps them to propose better design alternatives.
But, before we can insist on this advantage in competition, the obligation is to prove, that the UPIs defect detection capability is comparable to established inspection methods, especially the top dog Heuristic Evaluation.
To get a first idea of the UPIs performance it was compared to the HE in a small sample inspection experiment. The defects identified by the participants were afterwards verified via falsification testing.
The results are promising!
The performance criteria  thoroughness, validity and effectiveness were compared and the participant using the UPI performed practically equal to the HE group in all disciplines. The UPI is thus capable of capturing as many true defects as the HE and doesn’t produce more false hits. In other words, there is at the moment no argument for practitioners to prefer the HE and not to profit from the design recommendations the UPI was designed for.
But the results are also disappointing!
Inspection methods are usually employed, because they are easy to lean and quick to apply. But in our experiment, we got very low values for the identification pf usability defects:

  • six HE inspectors together found 28 true defects
  • four UPI inspectors together found 22 true defects
  • four sessions of falsification usability testing yielded 86 true defects.

This computes to a probability of .15 for a  single  inspector to detect a  certain defect. According to the curve of diminishing returns (Virzis formula) a group of 10 inspectors was needed to capture 80% of all defects with this low base probability. Of course this was under strict laboratory conditions with a quite short time of learning and inspecting. But it is still quite low, compared to the subsequent usability test, which uncovered 86 defects with just four sessions.

Ubuntu Mobile and Embedded Edition

An announcement the Ubuntu developer mailing list said, that their will be a special distri for mobile devices. After seven years of struggle to have Linux run on Notebooks I can only say: COOL! I hope they will actually take notebooks into account, which todays are at the heavy weight end of mobile devices. But I would also like to see Ubuntu Linux running with full desktop on embedded devices like all those mediaboxes, NAS devices or routers.

Powered by ScribeFire.

Preparation of Falsification Usability Testing

In recent research on evaluation Usability inspection methods falsificatory usability tests (FUT) are employed in order to identify false alarms. False alarms are those defects predicted by at least one inspector, which are not true defects, meaning that they do not cause actual usage problems.
Thus, a FUT aims at challenging those defects predicted in an inspection experiment and to classify them as either true or false. This is opposed to exploratory usability tests, which are employed to identify as many true defects as possible.
As a consequence a FUT can be constructed much narrower in scope and with much stricter observation scheme. This can be acomplished with the following procedure:

  1. Constructing the observation scheme
    1. Collect all defect identifications from the inspection experiment.
    2. Normalize the defects across inspectors resulting in a set of unique defect descriptions.
    3. Gather all comments from the inspectors and group them with the normalized defects.
    4. Review the defects and compile a set of observations for each defects: Use the inspector comments, the inspection task, where the identification occurred and the defect description to predict observable usage problems the defect is likely to cause.
  2. Construction of testing material
    1. Gather an initial set of testing tasks which cover all parts of the tested application, were defects were identified.
    2. Assign defects to tasks, which are likely to occur with the task. If the set of defects is large, this can be done with an intermediate step: First assign defects to dialogue elements and dialogue elements to tasks. This will result in a rough asignment of defects to tasks. Than review the initial set of defects per tasks and eliminate those, which are not likely to occur.
    3. Review the set of defects assigned to tasks:
      • assure that each defect is at covered at least by one task
      • if a defect is covered by very few tasks, make sure that those tasks will most likely challenge the defect
      • if a defect is covered by many tasks, eliminate the defect from those tasks, where it is less likely to occur
    4. Prepare an observation record with the following structure:
      • Task
        • Dialogue Element
          • Defect
            • Observations

      (Have a look at the example at the end of the page)

  3. Run the test
    1. Prepare the usability test as usual.
    2. Remind the user to:
      • think aloud
      • strictly follow the order of testing tasks
    3. Use the observation record to gather the observations for each user.
  4. Analyse the data: 
    1. Keep in mind, that with a FUT it suffices to have only one observation for each defect. Thus, the observation protocol can be reduced by already observed defects after each session. Even the task set can be reduced, if there are no open defects for one task.
    2. Classify the defects:
      • defects observed in the FUT become true defects
      • defects not observed become  false defects
    3. Review your response matrix from the inspection protocoll. You should now be able to classify each identification event as hit, miss or false alarm.

The classified response matrix can now serve to compute quality criteria for the evaluated inspection method. In particular the validity can be estimated. The validity is a measure of how fre the method is of false alarms. But keep in mind, that the FUT will not provide the preliminaries to estimate the thoroughness of the method, in that the FUT does not explore for previously unknown defects.

Front Focussing of Sigma 24mm/1.8, Nikon 85mm/1.8 and Tamron 17-50mm/2.8

Surprise, Surprise! Got my Sigma 24mm back (remember the quick comparison with a Tamron 17-50 a few weeks ago). A professional photographer bought it and returned it a few weeks later, because of focussing issues. He suspected that it might suffer from Front Focussing. This is a problem with some third party lenses, where the autofocus focusses a point in front of the object and not exactly the object plane.  A short explanation and a quick test can be found at Photo.net. A more thorough test procedure is here. So I tested the Sigma and the two other lenses I own with the quick prodedure on my Nikon D80. This procedure also reveals tendencies of chromatic aberations in the defocus areas.

Results
The Nikon 85mm/1.8 focusses perfectly fine, but shows strong vertical chromatic aberrations in the defocus areas: Green behind the focussing plane and magenta in front. The Tamron 17-50mm/2.8 focusses well, with – maybe – a slight tendency towards back focussing. There is nearly no chromatic aberration. The Sigma shows a clear front focussing affect and mild chromatic aberations.

Have a look at the test images:

Nikon Nikkor 85mm/1.8
_DSC4423.JPG

Sigma 24mm/1.8 EX DG Macro
_DSC4421.JPG

Tamron 17-50mm/2.8
_DSC4417.JPG

Shock: My Nikon got its feet wet

On saturday we went on a first hiking excursion in the Bavarian forrest (the Lusen, ~1360m). I made some nice shots of the dying forrest at the south east mountain side. When we lunched at the Lusen hut with an incredible shock I noticed that my water bag had dripped and water had already entered my camera (Nikon D80). Fog was visible at the display and inside the viewer. Immediatly I removed the battery. Back at home I stored the camera and the lense at a sunny warm place. Both were open but covered with a black sleave. Today everything seems to be fine again. After carefully cleaning the body inside and the lense with a brush and a Rocket Air.
So branded I’m looking for a holster bag that is really waterproof. I think I’m going for a Ortlieb Aqua Zoom. Ortlieb gear usually is tough, absolutely waterproof and reasonable light. Also they provide a four point strap to fix the camera on your belly. Must be great for outdoor activities…

Comparison of Sigma EX DG 24mm/1.8 versus Tamron 17-50mm/2.8 XR DiII

Today I got my new lense. A Tamron 17-50mm/2.8. This will probably replace (and augment) my Sigma 24mm/1.8. Though I liked the handy focal width and fastness of the Sigma, I often suspected it to produce quite unsharp and dull pictures. A quick comparison with a few shots at aperture 2.8 exposes the truth. The Tamron is the better lense in any way (except that it’s slightly slower, of course). It is much sharper, brilliant and produces less chromatic aberations and flares even wide open. Have a look!

Sigma EX DG 24mm/1.8 at F2.8 Tamron SP AF 17-50mm/2.8 XR DiII at F2.8
DSC_3687.JPG DSC_3688.JPG
DSC_3688_crop.JPG DSC_3687_crop.JPG
DSC_3689.JPG DSC_3691.JPG

Measurement and Noise

Types of noise

Consider noise in digital photography (and find a well illustrated introduction here). I will summarize:
Noise is a deviation or error from the true picture, which is totally unrelated to the picture. Unrelated means, that one cannot predict the appearance of noise pixel if the true picture is known.

Basically, there are two types of noise on digital sensors:

  1. Noise that stems from unregularities of a particular piece of sensor. This noise is still unrelated to the picture, but it is predictable from one exposure to another.
  2. Totally random noise. Not predictable by anything.

Removing noise

It is easy to remove the totally random noise. Just take two or more exposures of the same true picture and average them. Since the noise is not correlated with the picture nor the camera it will diminish.

It is a little trickier to remove the sensor related noise. If you took several exposures of the same true picture with the same camera, the noise would also sum up and not be removed.
There are two things you can do: Substraction of the sensors noise profile or multiple exposures with different cameras.

If you choose the substractive approach, you first have to assess the sensors noise profile. How to do that? Simply take multiple exposures of different random pictures (e.g. sections of a white wall). Average the images. What remains wil be the noise profile of your sensor. You can now substract it from every future picture you make.

The alternative is to take the picture with different cameras. As the noise is only related to a specific piece of sensor, it will diminish with averaging.

Generalization

“Why is this piece of blog in the category Usability Research?”

“Because there are phenomena regarding noise in usability evaluations. And we can learn from drawing the analogy.”

Up to this we can distinguish the following terms:

  • The measured object is what we want to represent as exact as possible
  • We use a certain instrument, …
  • … which produces instrument specific noise
  • Additionally we have to fight totally random noise.
  • We have the ability to make multiple independant measures with the same instrument …
  • … and multiple measures with different instruments

Multiple measures with the same instrument will remove totally random noise with the same object or will identify the instrument specific noise when different objects are measured.
Or we can measure the same object with different instruments to remove instrument related deviations.

How does this apply to usability evaluation methods? Next time!

Fat(t)ales Tonemapping: Effekte der Bildskalierung

Für das Tonemapping von HDR Landschaftspanoramen habe ich bislang die besten Effekte mit dem Fattal02 Tonemapping aus der PFStmo Toolsammlung erzielt. Allerdings tritt dabei ein unerwarteter Effekt auf, der die Vorhersage des Ergebnisses schwierig macht:

Das Ergebnis hängt stark von der Auflösung des verwendeten Bildes ab.

Um die Auswirkung der Tonemapping Parameter schneller einschätzen zu können, habe ich zunächst mit einer Skalierung von 0.2 gearbeitet. Auf der Kommandozeile sieht das folgendermaßen aus:

pfsinexr Pano1-Pano3.exr| pfssize --ratio 0.2 --filter MITCHELL | pfsgamma -g 0.7 |pfstmo_fattal02 -v -a 0.05 -b 0.8 -s 0.8 |pfsout Pano1-Pano3_ldr1.tif

Fattal_005_08_08_scale02.jpg

Im Vergleich sieht das Ergebnis mit einer Skalierung von 0.8 deutlich anders aus:
pfsinexr Pano1-Pano3.exr| pfssize --ratio 0.2 --filter MITCHELL | pfsgamma -g 0.7 |pfstmo_fattal02 -v -a 0.05 -b 0.8 -s 0.8 |pfsout Pano1-Pano3_ldr1.tif

Fattal_005_08_08_scale08.jpg

Ganz bizzar wird das Ergebnis, wenn man in der HDR Verarbeitungskette ganz auf eine Skalierung verzichtet (abgesehen davon, dass mein Rechner dann mehrere Minuten unter Volllast rechnet). Man beachte die Moiree-artigen Artefakte im Zentrum.
pfsinexr Pano1-Pano3.exr| pfsgamma -g 0.7 |pfstmo_fattal02 -v -a 0.05 -b 0.8 -s 0.8 |pfsout Pano1-Pano3_ldr2.tif

Fattal_005_08_08_noscale

Fazit: Der Fattal Algorithmus liefert für mich immer noch die besten Ergebnisse beim Tonemapping kontrastreicher Landschaftsaufnahmen. Die Ergebnisse sind aber sehr schwer vorherzusagen. Insbesondere kann man sich nicht auf die Erscheinung einer verkleinerten Vorschau verlassen.

Ob die Effekte durch den Resizer in der PFStools Befehlskette entsteht oder erst beim Tonemapping entsteht ist momentan noch unklar und wird sich in weiteren Experimenten zeigen.

powered by performancing firefox

Defects, problems and uncertainty in falsification testing

Usability Inspection Methods are widely used to predict usage problems with a product. The outcome is usually a list of identified usability defects in the design of the product. Accordingly, as Woolrych et. al. (2004) point out, a defect identification is only valid, if evidence for a usage problem can be found. They suggest Falsification User Testing to achieve this. FUT is done subsequently to the assessment study, and the testing tasks are tightly focussed on the identified defects. Thus, a test person is very likely to stumble on it, assumed it exists.

Of course, there exists a certain amount of uncertainty. Defect identification and usage problem identification are stochastic processes (remember the curve of diminishing returns) and thus, a true problem might not be observed with your finite set of test persons.

What I want to remark here are two things:

  1. Terminology:
    • A user usually has a usage goal, and thus (s)he can also have a usage problem.
    • On the other hand: products have attributes, like usability, and they can have defects which lower the attribute, thus usability defect.
    • A UI designer has the goal of designing for usability. hence, he can have either a design problem (more general, as it also takes other -Bilities into account) or a usability problem.
  2. The before mentioned uncertainty is not too relevant. The severity of a usability defect should be a function of the frequency and the impact of resulting usage problems (Rubin ,1994). Hence, if a usage problem is not observed in a reasonable sample, it won’t have a high frequency and likely not a strong impact. This uncertainty should therefor prominently appear with minor defects.

Bibliography
Alan Woolrych; Gilbert Cockton & Mark Hindmarch Falsification Testing for Usability Inspection Method Assessment Proceedings of the HCI 2004, 2004

Rubin, J. Handbook of Usability Testing. How to Plan, Design, and Conduct Effective Tests John Wiley & Sons, 1994