Notes by ff123
January 22, 2002: Added subanalysis for listeners with highly correlated preferences
These are results for the second series of group blind tests I ran for codecs which average about 128 kbit/s. The test was open for a little over two months, and between 25 to 28 listeners participated. The individual ratings and comments are made available as follows, now that the test is complete:
Listener
comments for fossiles.wav
Listener
comments for rawhide.wav
Listener
comments for wayitis.wav
The statistical method I am using to test for significant differences in the means of the ratings is called resampling, and I am using a resampling stepdown to adjust the p-values. Garf and I wrote the C code to perform this analysis, which I have made available here. The advantage of this method over traditional methods such as ANOVA using Fisher's Least Significant Difference with no adjustments for multiplicity is that it assures strong control of experiment-wise error and does not assume normal distribution of listeners, and thus makes for robust conclusions.
The resampling model assumes that each listener's ratings are correlated with each other. In classical terms, I am performing a "blocked" analysis, with each listener representing a separate block. This increases the power of the analysis.
Note: the resampling stepdown method used to adjust the p-values is called a "free" stepdown method. There is a "restricted" stepdown method which is potentially even more powerful, but it has not yet been implemented into the code used here.
Comments on the results: I didn't want to use extremely difficult samples, which wouldn't be representative of most normal music. However, I erred on the side of choosing samples which were too easy for the encoders at 128 kbit/s. Fossiles.wav proved to lack any kind of discriminating power. Rawhide.wav had significant results at one point during the test, but these subsequently faded into the noise as further listeners submitted results. Wayitis had adequate discriminating power, however the 28th (and last) listener submitted results which deviated significantly from rest of the group, causing three comparisons to loses significance. I verified to my satisfaction that this listener was not a troll, that he performed the test properly, and that he was truly able to hear differences from the original (submitting highly significant ABX results). I cannot discard this data. But the wildly discrepant preferences puzzle me.
The next test I organize will hopefully use a tool better suited to post-screening, such that results from listeners who consistently rate the original worse than encoded files will be discarded. For the current test, I am bound to keep all results as being equal in worth, even though I suspect that not to be the case.
Resampling analysis, fossiles.wav:
25 listeners
Each listener's resampled values are chosen from his own pool of ratings
(blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.
Means:
wma8 lame aac mpc ogg xing
4.58 4.55 4.51 4.50 4.36 4.30
Unadjusted p-values
lame aac mpc ogg xing
wma8 0.846 0.646 0.628 0.177 0.083
lame - 0.790 0.771 0.247 0.123
aac - - 0.981 0.371 0.201
mpc - - - 0.384 0.210
ogg - - - - 0.699
Adjusted p-values
lame aac mpc ogg xing
wma8 0.991 0.987 0.987 0.738 0.507
lame - 0.991 0.991 0.805 0.626
aac - - 0.991 0.917 0.770
mpc - - - 0.915 0.771
ogg - - - - 0.991
Comparisons in italicized red below are true as a group with 95% confidence.
Resampling analysis, rawhide.wav:
26 listeners
Each listener's resampled values are chosen from his own pool of ratings
(blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.
Means:
aac ogg wma8 lame mpc xing
4.54 4.54 4.47 4.31 4.28 4.22
Unadjusted p-values
ogg wma8 lame mpc xing
aac 0.979 0.611 0.104 0.066 0.024*
ogg - 0.630 0.110 0.070 0.026*
wma8 - - 0.261 0.182 0.079
lame - - - 0.830 0.520
mpc - - - - 0.668
Adjusted p-values
ogg wma8 lame mpc xing
aac 0.984 0.978 0.505 0.413 0.206
ogg - 0.978 0.505 0.417 0.209
wma8 - - 0.781 0.665 0.435
lame - - - 0.978 0.959
mpc - - - - 0.978
Comparisons in italicized red below are true as a group with 95% confidence.
aac is better than xing
ogg is better than xing
Resampling analysis, wayitis.wav:
28 listeners
Each listener's resampled values are chosen from his own pool of ratings
(blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.
Means:
ogg mpc aac lame xing wma8
4.34 4.33 4.15 4.02 3.59 3.52
Unadjusted p-values
mpc aac lame xing wma8
ogg 0.973 0.379 0.145 0.001* 0.000*
mpc - 0.397 0.154 0.001* 0.000*
aac - - 0.561 0.011* 0.004*
lame - - - 0.048* 0.021*
xing - - - - 0.740
Adjusted p-values
mpc aac lame xing wma8
ogg 0.974 0.848 0.539 0.007* 0.002*
mpc - 0.840 0.534 0.008* 0.003*
aac - - 0.915 0.081 0.036*
lame - - - 0.251 0.137
xing - - - - 0.932
Comparisons in italicized red below are true as a group with 95% confidence.
ogg is better than wma8
mpc is better than wma8
ogg is better than xing
mpc is better than xing
aac is better than wma8
aac is better than xing
lame is better than wma8
lame is better than xing
January 22, 2002
Rich Ulrich, biostatistician at the University of Pittsburgh, provided the following comments concerning the wayitis.wav sample.
Okay, I just looked at wayitis, where there are 6 ratings per person, for 28 persons, of which 24 are useful; 4 show no variation (rated 5 for all).
I copied the data; I transposed it to give me 6 lines, one per codec, with 24 Raters. (SPSS Flip procedure.)
I used the Reliability procedure of SPSS, which includes a 'corrected item-total correlation' -- the correlation of each Rater (in this case) with the average for all the *other* raters. I was surprised to see that 6 of the 24 were negative. Here are the values, decimals omitted: 96, 95, 92, 86, 86, 86, 84, 82, 80, 64, 58, 56, 56, 46, 38, 37, 35, 35, 31, and negative: 23, 47, 50, 56, 71, 81.
The raters with -.81 and -.71 were subjects whose scores started (4.5,3.5,1.0 -- last in the list) and (5.0,5.0,4.0 -- #13 out of 24).
With an N=6, I don't know if the correlations below 70 represent random variation of poor reliability, or what. But the higher ones ought to represent consistent information, even if reported in the wrong direction.
So, I see 2 cases where something went badly amiss, and 4 with negative r, and 5 or 10 with small r. I looked closer at the negatives. Their average scores are patterned just the reverse of the positive scores; there seems to be no new information.
If these were my data: I would want to figure out what could have gone wrong, especially with the 2: faulty randomization? spoofing?
Any complete report will describe the inconsistencies. Some raters show the reverse? - because of odd equipment? odd subjective standards?
Given the lack of controls in conditions and equipment, I think a sub-analysis is justified, based on the 9 'consistent' raters alone.
I verified Rich's numbers using my poor-man's version of SPSS (Microsoft Excel -- I didn't check to see if it could be done with R). The correlation coefficients for each rater are as follows. The nine consistent listeners are highlighted in blue. Note: all 28 responses are valid data. This subanalysis attempts to isolate the most common preference. Having a minority and divergent preference does not make the data invalid!
| Corrected Item-Total Correlations | ||||
| Listener | r | Listener | r | |
| 1 | 0.86 | 17 | -0.72 | |
| 2 | 0.95 | 18 | 0.82 | |
| 3 | 0.31 | 19 | 0.96 | |
| 6 | 0.81 | 20 | -0.49 | |
| 7 | -0.56 | 21 | 0.46 | |
| 8 | 0.64 | 22 | -0.48 | |
| 9 | 0.58 | 23 | 0.86 | |
| 10 | 0.86 | 24 | -0.23 | |
| 11 | 0.56 | 25 | 0.37 | |
| 13 | 0.38 | 26 | 0.35 | |
| 14 | 0.84 | 27 | 0.92 | |
| 16 | 0.35 | 28 | -0.81 | |
When these nine listener responses are analyzed, the results are:
Resampling analysis (nine "consistent" listeners, wayitis.wav):
9 listeners
Each listener's resampled values are chosen from his own pool of ratings
(blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.
Means:
4.63 4.09 3.61 3.36 2.11 2.04
Unadjusted p-values
ogg lame aac wma8 xing
mpc 0.022* 0.000* 0.000* 0.000* 0.000*
ogg - 0.043* 0.003* 0.000* 0.000*
lame - - 0.270 0.000* 0.000*
aac - - - 0.000* 0.000*
wma8 - - - - 0.772
Adjusted p-values
ogg lame aac wma8 xing
mpc 0.078 0.000* 0.000* 0.000* 0.000*
ogg - 0.116 0.011* 0.000* 0.000*
lame - - 0.467 0.000* 0.000*
aac - - - 0.000* 0.000*
wma8 - - - - 0.772
Comparisons in italicized red below are true as a group with 95% confidence.
mpc is better than xing
ogg is better than xing
lame is better than xing
aac is better than xing
mpc is better than wma8
ogg is better than wma8
lame is better than wma8
aac is better than wma8
mpc is better than aac
ogg is better than aac
mpc is better than lame
ogg is better than lame
mpc is better than ogg
NOTE: If a tenth listener is added (the one with
a high negative correlation of -0.81) to this subanalysis, the
results essentially look like the overall results for the total
28 listeners. That is, one sensitive listener with highly
divergent preferences has a large effect.