Notes by ff123
The 64 kbit/s Listening Test plan and method are shown here.
The 64 kbit/s Listening Test results are here.
1. Who is this
"ff123" person
2. the test methodology involves
having people send in subjective ratings based upon listening to
some sound samples
3. Who cares about 64 kbps tests?
4. No control
1. Who is this "ff123" person, and why exactly should anyone care about a bunch of numbers that it could have very well pulled from its ass?
Response: The study stands or falls on its own merits, regardless of who I am. The raw data is all there for everybody to inspect. See the 64 kbit/s Listener Comments Page. See the Practice With ABC/HR Page for an explanation of the double-blind application which was used to collect listener ratings and comments.
2. Yes, I noticed that the test methodology involves having people send in subjective ratings based upon listening to some sound samples. Basing an opinion on something as enormously subjective as how people interpret sound, especially since they're doing so with non-standardized audio systems.
These surveys have absolutely zero scientific worth and are non-reproducable. If you want to read an article that uses good methodology and testing prodecures, go to http://www.r3mix.net and click on the analysis button.
Response: The only valid way to evaluate perceptual (lossy) codecs is to perform a subjective test. Subjective (or sensory) testing (also called psychophysics) is a scientific discipline, with dedicated journals such as Chemical Senses, Journal of Sensory Studies, and Journal of Texture Studies. My primary reference has been a book called Sensory Evaluation Techniques, 3rd Edition, by Meilgaard, Civille, and Carr.
Objective techniques normally used for evaluating audio quality are meaningless when faced with perceptual codecs. These codecs are designed to take advantage of certain distortions which, while easily measured, are difficult or impossible to hear. That's why the developers and entities responsible for evaluating lossy codecs rely upon listening tests. For an example of a properly performed listening test conducted by the MPEG group, see the MPEG group's "Results of Performance Tests" page and download their "Results of AAC subjective tests." The r3mix.net test method of comparing the frequency responses of various codecs using test signals, while perfectly repeatable, is also perfectly invalid for evaluating perceptual codecs.
Regarding the reproduceability of subjective tests, results which are shown to be at the 95% confidence level means that the results obtained are likely to be repeatable given the same test conditions and listeners, 95 times out of 100.
Regarding non-standardized audio systems: The concern is that the results could be biased. The listening setups were not standardized, nor was any attempt made to perform a survey of what people were using to listen to the samples. While it might have been interesting to find out if people were using headphones or speakers, or if substandard (i.e., typical multimedia computer speakers) were being used, I think from a practical standpoint, it wouldn't have changed the test procedure or the analysis. How does one account for other things such as background noise in a test? This is as important a factor in being able to hear artifacting as the actual listening equipment. So is listener experience in hearing artifacting.
The most likely effect, in my opinion, of not having controls on the listening setup or on the listening environment, or on the actual listener, is not to add a bias, but rather to add statistical noise or to lessen the sensitivity of the comparison. In other words, I think that untrained listeners or people using cheap multimedia speakers would probably not be able to hear some of the artifacting that others hear, but their relative ratings of codecs would likely not differ radically from the experienced group of people using high quality headphones. Of course that's an unproven conjecture, but I think a reasonable one.
There's another reason why I think that the lack of controls on listener setup, environment, and training is not a fatal flaw: the actual results were statistically significant. I think the most reasonable interpretation of this is that the test was able to discriminate between codec quality in spite of, not because of, the lack of controls.
3. Who cares about 64 kbps tests?
Response: People who stream. Also, this is an interesting bitrate to me. Microsoft claims that 64 kbit/s wma8 files are "CD quality." I wanted to test this claim, and also compare the various other formats competing at this bitrate, such as mp3pro and ogg vorbis. However, I don't agree with people who have claimed that results at this bitrate have relevance at other bitrates. Additional tests must be performed to say anything meaningful about codec performance at other bitrates.
4. While the tests are interesting, they're not scientific. No control, the users could tell what codec was being used, and the participants were self-selected. Durandal's point about the machine setup is valid also, ideally they'd all be the same.
Response: There was a control. Before each codec could be rated, the listener was forced to choose between the original and the coded sample. If the listener rated the original less than perfect on any codec, his results were discarded. The codecs being rated were known; however the listener did not know which codec was being evaluated at any particular time. See the Practice With ABC/HR Page for an explanation of the double-blind application which was used to collect listener ratings and comments.
The participants were self-selected, that is true. So this test does not claim to say anything about the preferences of the general population. Instead it says something about the preferences of interested listeners, a group which is likely to be more sensitive overall to artifacting than the general population. The point of machine setup is responded to in paragraph 2.
Return to ff123's Home Page