Final Results of 128 Tests

Notes by ff123

January 22, 2002: Added subanalysis for listeners with highly correlated preferences

These are results for the second series of group blind tests I ran for codecs which average about 128 kbit/s. The test was open for a little over two months, and between 25 to 28 listeners participated. The individual ratings and comments are made available as follows, now that the test is complete:

Listener comments for fossiles.wav
Listener comments for rawhide.wav
Listener comments for wayitis.wav

The statistical method I am using to test for significant differences in the means of the ratings is called resampling, and I am using a resampling stepdown to adjust the p-values. Garf and I wrote the C code to perform this analysis, which I have made available here. The advantage of this method over traditional methods such as ANOVA using Fisher's Least Significant Difference with no adjustments for multiplicity is that it assures strong control of experiment-wise error and does not assume normal distribution of listeners, and thus makes for robust conclusions.

The resampling model assumes that each listener's ratings are correlated with each other. In classical terms, I am performing a "blocked" analysis, with each listener representing a separate block. This increases the power of the analysis.

Note: the resampling stepdown method used to adjust the p-values is called a "free" stepdown method. There is a "restricted" stepdown method which is potentially even more powerful, but it has not yet been implemented into the code used here.

Comments on the results: I didn't want to use extremely difficult samples, which wouldn't be representative of most normal music. However, I erred on the side of choosing samples which were too easy for the encoders at 128 kbit/s. Fossiles.wav proved to lack any kind of discriminating power. Rawhide.wav had significant results at one point during the test, but these subsequently faded into the noise as further listeners submitted results. Wayitis had adequate discriminating power, however the 28th (and last) listener submitted results which deviated significantly from rest of the group, causing three comparisons to loses significance. I verified to my satisfaction that this listener was not a troll, that he performed the test properly, and that he was truly able to hear differences from the original (submitting highly significant ABX results). I cannot discard this data. But the wildly discrepant preferences puzzle me.

The next test I organize will hopefully use a tool better suited to post-screening, such that results from listeners who consistently rate the original worse than encoded files will be discarded. For the current test, I am bound to keep all results as being equal in worth, even though I suspect that not to be the case.

 


Resampling analysis, fossiles.wav:
25 listeners
Each listener's resampled values are chosen from his own pool of ratings
  (blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.

                            Means: 
wma8     lame     aac      mpc      ogg      xing
  4.58     4.55     4.51     4.50     4.36     4.30

                            Unadjusted p-values
         lame     aac      mpc      ogg      xing
wma8     0.846    0.646    0.628    0.177    0.083
lame       -      0.790    0.771    0.247    0.123
aac        -        -      0.981    0.371    0.201
mpc        -        -        -      0.384    0.210
ogg        -        -        -        -      0.699

                             Adjusted p-values
         lame     aac      mpc      ogg      xing
wma8     0.991    0.987    0.987    0.738    0.507
lame       -      0.991    0.991    0.805    0.626
aac        -        -      0.991    0.917    0.770
mpc        -        -        -      0.915    0.771
ogg        -        -        -        -      0.991

Comparisons in italicized red below are true as a group with 95% confidence.


Resampling analysis, rawhide.wav:
26 listeners
Each listener's resampled values are chosen from his own pool of ratings
  (blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.

                            Means:
aac      ogg      wma8     lame     mpc      xing
  4.54     4.54     4.47     4.31     4.28     4.22

                            Unadjusted p-values
         ogg      wma8     lame     mpc      xing
aac      0.979    0.611    0.104    0.066    0.024*
ogg        -      0.630    0.110    0.070    0.026*
wma8       -        -      0.261    0.182    0.079
lame       -        -        -      0.830    0.520
mpc        -        -        -        -      0.668

                             Adjusted p-values
         ogg      wma8     lame     mpc      xing
aac      0.984    0.978    0.505    0.413    0.206
ogg        -      0.978    0.505    0.417    0.209
wma8       -        -      0.781    0.665    0.435
lame       -        -        -      0.978    0.959
mpc        -        -        -        -      0.978

Comparisons in italicized red below are true as a group with 95% confidence.

aac  is better than xing
ogg  is better than xing

Resampling analysis, wayitis.wav:
28 listeners
Each listener's resampled values are chosen from his own pool of ratings
  (blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.

                            Means:
ogg      mpc      aac      lame     xing     wma8
  4.34     4.33     4.15     4.02     3.59     3.52

                            Unadjusted p-values
         mpc      aac      lame     xing     wma8
ogg      0.973    0.379    0.145    0.001*   0.000*
mpc        -      0.397    0.154    0.001*   0.000*
aac        -        -      0.561    0.011*   0.004*
lame       -        -        -      0.048*   0.021*
xing       -        -        -        -      0.740

                             Adjusted p-values
         mpc      aac      lame     xing     wma8
ogg      0.974    0.848    0.539    0.007*   0.002*
mpc        -      0.840    0.534    0.008*   0.003*
aac        -        -      0.915    0.081    0.036*
lame       -        -        -      0.251    0.137
xing       -        -        -        -      0.932

Comparisons in italicized red below are true as a group with 95% confidence.

ogg  is better than wma8
mpc  is better than wma8
ogg  is better than xing
mpc  is better than xing
aac  is better than wma8
aac  is better than xing
lame is better than wma8
lame is better than xing

January 22, 2002

Rich Ulrich, biostatistician at the University of Pittsburgh, provided the following comments concerning the wayitis.wav sample.

Okay, I just looked at wayitis, where there are 6 ratings per person, for 28 persons, of which 24 are useful; 4 show no variation (rated 5 for all).

I copied the data; I transposed it to give me 6 lines, one per codec, with 24 Raters. (SPSS Flip procedure.)

I used the Reliability procedure of SPSS, which includes a 'corrected item-total correlation' -- the correlation of each Rater (in this case) with the average for all the *other* raters. I was surprised to see that 6 of the 24 were negative. Here are the values, decimals omitted: 96, 95, 92, 86, 86, 86, 84, 82, 80, 64, 58, 56, 56, 46, 38, 37, 35, 35, 31, and negative: 23, 47, 50, 56, 71, 81.

The raters with -.81 and -.71 were subjects whose scores started (4.5,3.5,1.0 -- last in the list) and (5.0,5.0,4.0 -- #13 out of 24).

With an N=6, I don't know if the correlations below 70 represent random variation of poor reliability, or what. But the higher ones ought to represent consistent information, even if reported in the wrong direction.

So, I see 2 cases where something went badly amiss, and 4 with negative r, and 5 or 10 with small r. I looked closer at the negatives. Their average scores are patterned just the reverse of the positive scores; there seems to be no new information.

If these were my data: I would want to figure out what could have gone wrong, especially with the 2: faulty randomization? spoofing?

Any complete report will describe the inconsistencies. Some raters show the reverse? - because of odd equipment? odd subjective standards?

Given the lack of controls in conditions and equipment, I think a sub-analysis is justified, based on the 9 'consistent' raters alone.

I verified Rich's numbers using my poor-man's version of SPSS (Microsoft Excel -- I didn't check to see if it could be done with R). The correlation coefficients for each rater are as follows. The nine consistent listeners are highlighted in blue. Note: all 28 responses are valid data. This subanalysis attempts to isolate the most common preference. Having a minority and divergent preference does not make the data invalid!

Corrected Item-Total Correlations
Listener r   Listener r
1 0.86   17 -0.72
2 0.95   18 0.82
3 0.31   19 0.96
6 0.81   20 -0.49
7 -0.56   21 0.46
8 0.64   22 -0.48
9 0.58   23 0.86
10 0.86   24 -0.23
11 0.56   25 0.37
13 0.38   26 0.35
14 0.84   27 0.92
16 0.35   28 -0.81

When these nine listener responses are analyzed, the results are:

Resampling analysis (nine "consistent" listeners, wayitis.wav):
9 listeners
Each listener's resampled values are chosen from his own pool of ratings
  (blocked analysis).
Blocked model used to calculate unadjusted p-values.
100,000 bootstrap trials are performed to adjust p-values using free step-down.

                            Means:
4.63     4.09     3.61     3.36     2.11     2.04

                            Unadjusted p-values
         ogg      lame     aac      wma8     xing
mpc      0.022*   0.000*   0.000*   0.000*   0.000*
ogg        -      0.043*   0.003*   0.000*   0.000*
lame       -        -      0.270    0.000*   0.000*
aac        -        -        -      0.000*   0.000*
wma8       -        -        -        -      0.772

                             Adjusted p-values
         ogg      lame     aac      wma8     xing
mpc      0.078    0.000*   0.000*   0.000*   0.000*
ogg        -      0.116    0.011*   0.000*   0.000*
lame       -        -      0.467    0.000*   0.000*
aac        -        -        -      0.000*   0.000*
wma8       -        -        -        -      0.772

Comparisons in italicized red below are true as a group with 95% confidence.

mpc is better than xing
ogg is better than xing
lame is better than xing
aac is better than xing
mpc is better than wma8
ogg is better than wma8
lame is better than wma8
aac is better than wma8
mpc is better than aac
ogg is better than aac
mpc is better than lame
ogg is better than lame
mpc is better than ogg

NOTE: If a tenth listener is added (the one with a high negative correlation of -0.81) to this subanalysis, the results essentially look like the overall results for the total 28 listeners. That is, one sensitive listener with highly divergent preferences has a large effect.

Return to ff123's Home Page