Table of Contents
Multiple comparisons
T-test (paired and unpaired), Wilcoxon test (paired and unpaired) and McNemar only work for compariong 2 groups/sets of data
If you have more than two you have a problem of multiple comparison
1 Ombibus tests
Null hypothesis - all sets come from the same population.
If the null is not true (p-value <0.05) then it is not true that all came from the same distribution, but we do not know which one/ones are different and which ones are not significantly different
- Omnibus for parametric nonpaired data - ANOVA
- for parametric paired data - repeated measured ANOVA
- non parametric non-paired - Kruskal Wallis
- non parametric paired data - Friedman test
In the statistical literature paired data is also called within subject designs and non paired data is called between subjects
2 post hoc tests
Pairwise comparisons (using the appropriate 2 groups tests) plus p-values corrections (because of multiple comparions)
p-value correction many methods:
- bonferroni
- holms and others
- FDR - controlling only the false discovery rate (false positives)
Use Holms.
There are multiple comparions that result in a critical difference - difference between means or ranks below which the difference is not significative
- Tukey Test / Honest Significant Difference / Tukey Kramer test for nonpaired parametric comparisons
- Nemnyi test - for paired non-parametric comparions
- others (paired parametric or nonpaired nonparametric) I dont know
3 Comparing your algorithm with many others (paired)
Specially relevant for Machine Learning
Friedman + pairwise comparions with Holms correction
4 Comparing a set of different algoritms (paired)
Friedman + Nemenyi = Demsar procedure
further extensions of the demsar procedure implemented in the scmamp package in R
5 How to show that your classifier is better than the competition (Special for the ML students)
Comparing 2 groups - not a multiple comparison problem
Cross validation - separate training and test subsets of the data
- holdout
- k fold
- bootstrap
- and one can repeat each of them
Usual cvs
- 80/20 or 70/30 holdout
- 5 or 10-fold
- 100 repetitions of bootstrap
Let us assume a 5-fold.
- 5 test subsets (each fold)
- you could use a McNemar test (paired 0/1 data) but I have never seen anyone do that. (maybe because McNemar tests are less powerfull - last exercise!)
- you could measure the accuracy on the 5 test sets, and comprare the 5 numbers using Wilcoxon - but number of data is too low
- you could repeat the 5 fold 10 times, and have 50 numbers to compare - this is possible considered cheating.
- there is a problem that the measures for each fold are not really independent - they share 3/5 of the training set. You cannot trust the statistical tests that assume that the data is independent paper
- this paper propose the correlated t-test to deal with the non independence/correlation of the data. But I dont know of any paper that have use it.
- this paper propose a 5 repetition of a 2-fold (calles 5x2cv) - you will get 10 measures of accuracy - the author claims this balances the problem of correlation and enough number of data