Les også hva de sier vedrørende ''fallgruver'' ved ABX testing...
Number of testers : Studies made with a small number of listeners are more sensitive to mistakes occuring in the test setup. Wrong stimulus presented, mistakes copying the results etc. For this reason, when the result depends on one or two people, conclusions must be cautious.
Predictability level : there are more chances to have got a success after N tests have been performed, than performing only one test. For example, if we want to test something that has no effect, the result that we get will be decided by chance only. Imagine that 20 people run independant tests. According to chance, in average, one of them should get a "false positive" result, since a positive result is by definition something that occur no more than one time out of 20. The p calculation of each test does not take this into account.
Multiple comparisons : if we select two groups in the population, using one criterion, there will be less than 1 chance out of 20 to get a "statistical difference" between the two. However, if we consider 20 independant criterions, the probability to get a significant difference according to one of them is much higher than 1/20.
For example, if people are asked to rate the "dynamics", "soundstage", and "coloration" of an encoder, the probability to get a false positive is about thrice as high as with one criterion only, since there are three possibilities for the event to occur. Once again, the p value associated with each comparison is inferior to the real probability to get a false positive.
Et annet interessant poeng som også må sees i betraktning ved ABX hifi utstyr... er det supermann og mp3 eksemplet viser til
If we are testing the existence of Superman, and get a positive answer, that is "Superman really exists because the probability of the null hypothesis is less than 5%". Must we accept the existence of Superman ? Is it an infaillible, scientific proof of its existence ?
No, it's just chance. Getting an event whose probability is less than 5% is not uncommon.
However, when a listening test about MP3 at 96 kbps gives a similar significant result, we accept the opposite conclusion ! That it was not chance. Why ?
Why does the same scientific result should be interpreted in two opposite ways ? This is because we always keep the most probable hypothesis. The conclusion of an ABX test is not the p value alone, it is its comparison with the subjective p value of the tested hypothesis.
mvh,
tas