PQA500 Predicted DMOS Measurement Checklist
This page was created to facilitate verification of best practices in the use of the Tektronix PQA500 for predicting picture quality mean opinion scores relative to reference scores (or difference of mean opinion scores otherwise known as DMOS). Many of the best practices mentioned in this document are also applicable to other picture quality measurements, virtual or real.
Appropriate Use Of DMOS
Why The Checklist?
The Predicted DMOS is the most direct way to determine the relative mean opinion scores of video sequences as a quantification of differences seen between processed and reference video apart from actually conducting an experiment to gather the data from at least two dozen individuals as per the ITU-R BT.500 standard (for SD video or the equivalent for HD, CIF, etc.). The resulting measurement gives numeric values on a 100 point scale corresponding to the text/adjective scale used by individuals to rate the reference and test video clips (a factor of 20 is used to convert the nominal 5 point scale to 20 after the mean of the reference video is subtracted from the respective mean of the test video clip). The PQA500 conducts a virtual BT.500 study with virtual displays, virtual viewing environment, virtual humans (including very detailed macro-behavioral models of perceptual sensitivity and, optionally, the attentional and classification of artifacts aspects along with the requisite summary judgement component of the cognitive systems used for human vision and comprehension). Just as in real BT.500 studies where if displays, viewing conditions and demographic sampling are significantly different between labs performing the studies, DMOS results can be significantly different, so to can the virtual BT.500 study. Therefore, the checklist given below is important for verifying a match between virtual and the desired real DMOS study.
Use of Predicted DMOS vs PQR Measurement
Note If it is desired to determine the somewhat binary result of whether or not a viewing audience can see any difference between the reference video clip and the processed video clip, PQR is more appropriate. The PQR units are 1 in the difference maps where the difference between the reference and test video clips are just noticable (1 JND). The PQR result per field or frame and the overall result for the sequence use a Minkowski metric to give an overall impression of the difference. For very meticulous checking of any, even minute visible differences the maximum PQR over each field/frame and the overall sequence is also given in the results.
However, in applications where visible differences are not unexpected, and where the quantification of these (perhaps very subtle, or perhaps not so subtle) differences are needed, predicting DMOS is more appropriate.
The DMOS Prediction Checklist
For details of control parameters mentioned here, see the "PQA500 Picture Quality Analyser Technical Reference" document.
- File and video formats & TFF vs BFF:
The "PQA500 Picture Quality Analyser Technical Reference" document gives information regarding supported files and conversion to supported files from other types of files. A related issue for interlaced video is the "top field first" vs "bottom field first" which differ between PAL and NTSC but for video generator and video capture cards can swap these. Make sure that this has not happened with your video. There are a few ways to check for this: for example, using the measurement config temporal alignment tab, using the difference window (and successively cropping the top few lines of either the test or the reference video to reveal the top lines of the other, especially if vertical scale is maximized for both); another way is to make a measurement and use the cross-fade utility in the overlay view mode when reviewing results.
- Temporal Alignment:
Insufficient temporal and/or spatial alignment may be one of the most common reasons for poor DMOS prediction results, generally giving worse (larger) scores than expected. There are a few known conditions that make proper alignment ambiguous: for example if the video continously pans or zooms, there is generally an ambiguity between temporal and spatial alignment. For example, cropping of one side of the reference relative to the other side of the test along with spatial shift is possible with a horizontal pan, but there is no way to determine how much. Likewise for a zoom, there could be cropping (blanking) all around for one clip relative to the other, with an associated scale change, or not. The ambiguity is mitigated only by modulation in the pan or zoom, especially if it slows or stops momentarily. Another source of ambiguity comes from the fact that video codecs often have lag in motion of objects, so that spatial (mis-)alignment across individual frames can vary quite a bit, especially when there is motion including pan and zoom. Thus, if possible, try to include a portion of the video that does not include pan, zoom or at least a portion of video where motion delay due to codec distortions is minimized. If auto-temporal alignment fails with a warning about low correlation and search range, a different start and end point could also be used an measurement can be repeated. See spatial alignment regarding tips on search range settings. For temporal alignment the proportion missing point applies equally, only the units are frames instead of pixels.
- Spatial Alignment:
- Selecting the best frames to align:As was just mentioned above regarding temporal alignemnt, motion of individual objects in video can be delayed by video codecs and thus in the respective video frames, perfect alignment between reference and test frames is not possible. To mitigate this, it is generally best practice to spatially align frames known to not have such distortions, typically in frames with minimin motion, minimum spatial distortions (blur, mosquito noise, etc.), and maximum undistorted detail. When using the auto spatial alignment feature, the first frames selected from each sequence are used for spatial alignment. If these first frames are not ideal relative to the criteria just given, and if a better frame pair (temporally aligned) exists within the clip pair, then make the spatial alignment measurement there and use the results to set a fixed, manually controlled spatial alignment for best accuracy results.
- Search range, accuracy vs speed:If auto-temporal alignment fails with a warning about low correlation and search range, again, a different frame pair could also be used an measurement can be repeated. Note that the search range is not a scale or offset range, but rather what is the maximum percent of the reference missing in the test or vice versa. For example, if the reference is a 16x9 HD image and the test is a 4x3 SD image with sides cropped to fit, the percent of the horizontal missing in the test corresponds to the minimum search range required for alignment to work. If there is a one-to-one correspondence between test and reference (nothing missing in one relative to the other), then the search range can be set to the minimum value and the measurement will be fastest and most accurate.
- Display Models:
The display model is a key component to the virtual DMOS study used to predict DMOS. Just as we can't view video without a display, so too the virtual humans in the simulation cannot watch video without the virtual display. And just as with real displays, virtual displays convert the incoming digital video to light in different ways depending on the display technology. These differences result in differences in how the video looks. The most "transparent" looking display is the progressive CRT. Though new technologies are improving and becoming more "transparent," as of June, 2008, only very high end (somewhat essoteric) LCD displays come close, mostly by mimicing the progressive scan of CRT's. DMD's use temporal modulaiton of duty cycle which can result in artifacts as well.
- For the most "transparent" display, use progressive CRT display models:The smaller the resolution, the more important this is. In other words, for SD, a progressive display becomes more transparent than the interlaced display (interlaced displays have "flicker" and related visible artifacts) whereas for the HD display, the interlaced artifacts are generally much less noticable.
- Using the display your audience is using is the best for accurate simulationSo in some cases you may want to use an interlaced CRT for SD, or an LCD with it's motion blur now in a growing number of homes, etc. Comparison of displays can be done directly by running DMOS measurements using a progressive CRT reference display and an LCD test display, for example. Typical differences for moderate to fairly high motion video are between 10 and 25 points on the 100 point scale. This corresponds to a study done by Prof. Patrick Le Callet at the University of Nantes, France in which he conducted a DMOS study of the difference between MOS of video clips of various motion content viewed from CRT vs LCD.
- Worst case pitfalls of interlaced display simulation:Different frame rates and/or differing vertical offset and scale (compensated for using spatial alignment) of two low resolution or closely viewed bright interlace displays with bright video is a worst case scenero for the least transparent display. In such cases, video can actually look very different between the two channels (ref vs test display, etc.), but if you don't want these differences considered in making picture quality assessments, it is best to use a more transparent display (progressive CRT, for example).
- Removing the display model:
It is sometimes asked why not remove the display model so as to be the most transparent. The problem with this is that the human vision model needs light input. The conversion from digital video to light via a display is not ideally linear (not mathematically even approximately "transparent"). Ideally it includes a gamma power law conversion, which the display provides. Also, brightness and contrast controls depending on the ambient light and other factors in the viewing environment are often required. These are provided in the simulation by the display model (whereas the view model provides the virtual viewing environment). Note that the ITU-R BT.500 (and respective standards for other video formats) specify display and viewing environment because of the critical importance of these in PQ ratings. The PQA500 allows the removal of the display for cases where the Tektronix proprietary .lgt (light) file format is input. This is useful for applications such as display design verification, comparing perceptual differences between virtual reality generation vs real photography/film/video, etc.
- Use display model that matches your monitor, along with a viewing environment that matches yours, if you want to predict what you are seeing.
- View Model
- Spatial Alignment:See the spatial alignment section above.
- Ambient light:The lighting where the video is seen is important because very bright environments can make it difficult to see details in the darker portions of the video. Up to about 2 nits is common for office lighting on CRT and LCD computer displays. BT.500 ambient is specified in the standard and used for the first few associated preset DMOS measurement configurations (see respective measuremnt descriptions in the measurement configuration window or Technical Reference for BT.500 (or other standard) viewing conditions).
- Viewing distance: Besides the obvious aspect of not being able to see detail at great distances, at greater viewing distances, some low spatial frequency (large area differences) can become more noticable. Make sure that the viewing distance chosen matches what you want (a standard such as BT.500 for DMOS, or if you want it to match your opinion, select the viewing distance you use, etc.).
- Perceptual Model:
Generally no special attention is needed for our checklist here. However, for predicting DMOS from specific demographics (older vs younger people, experts vs typical user, you or your client, etc.), either the selection preset (typical viewer or expert viewer) or a custom selection can be made. For DMOS as per ITU-R BT.500, representing the population in general, "typical" is most appropriate. See the Technical Reference for details in custom configuring the human vision model to match a particular demographic, etc. It may be worth noting that the interlace artifacts often associated with large perceptual differences with different frame rates, vertical offsets, etc. can be mitigated by reducing the flicker sensitivity using the speed and luminance sensitivity controls of temporal response of the human vision model. Some people don't see flicker of white on CRT's even at 25 frames per second with a bright display at the most sensitive viewing distances, etc. However, care must be taken so as not to make the temporal response to slow to the point that temporal detail is lost that we all can see. This is why it is generally not recommended to change these settings.
- Attention Model:
Though this is a processor hungry model, it can improve DMOS prediction signifantly, depending on how much of the video fills a viewers field of view and the nature of the video. For example, for small format video on modile devices held about 10 screen hieghts away from the viewer, the attention model will not generally make much difference in DMOS prediction accuracy. However, for close viewing of HD or digital cinema, where there are many different places to look and you can't see everthing at once in detail, the attention model is very useful for predicting DMOS. Again, DMOS is good for predicting viewer opion of the quality. If you are interested in catching any visible artifact, PQR with no attention model is a much more appropriate measurement. The attention model will emphasize the areas where people are most likely to look, including artifacts if they are bad enough to be distractions (this is a key element that makes the attention model useful: studies {such as Prof. Le Callet, U. Nantes & K. Ferguson, Tektronix} have shown no improvement in DMOS prediction accuracy without the ability to predict distractions due to artifacts, etc.).
- Attention model tuned to the application:Though there are some preset measurements (ADMOS) that include attention model weighted DMOS for different applications (general video, talking head, sports), depending on the type of video, different aspects of the video may become more important for how they draw attention. The attention model allows for custom configurations for these. See the Technical Reference for details. This is an area to check if you have some very specialized video applicaion, such as certain types of video surveilance, etc.
- Artifact Detection:
This is mainly useful if it is known that a certain type of artifact will be more annoying than another. This tends to be application dependent. For talking heads and video where close-ups are used, a little softening (blurring) may not be nearly as objectionable as mosquito noise, jaggies and other added edge type artifacts. However, for a football or soccer game, the same blurring may cause the ball to disappear and so may be considered much more objectionable than a little mosquito noise and jaggies. In many applications, they are about equally objectionable. If all differences are considered equally bad, don't use this node in the measurement. However, if some are considered worse than others for a particular application, use this measurement node to weight the respective artifacts accordingly.
Also note that just as with the attention model, this is usually less important for small sized video and more important if you can't see all the video at once.
- Summary:
- Very Important! Setting the scale: The most important part of the checklist is to include in the virtual DMOS study the "worst case training" as specified in ITU-R BT.500. In this standard, an example of how bad the video will get during the viewing session is given to viewers before they start rating the video. This "trains" them as to how to use the scale so as to make sure everyone is using the same terminology, as one person's "good" might be another person's "fair," etc. Whatever DMOS measurement you want to make, make first with the worst case training, then edit the measurement and select the result as the worst case training in the summary node (see the Technical Reference for details if you need help with how to edit measurements, etc.).
Copyright © 2008
Kevin Ferguson/debone.com.
All Rights Reserved.
Any redistribution of information found at this site is prohibited.