‎

1. Ombibus tests
2. post hoc tests
3. Comparing your algorithm with many others (paired)
4. Comparing a set of different algoritms (paired)
5. How to show that your classifier is better than the competition (Special for the ML students)

Multiple comparisons

T-test (paired and unpaired), Wilcoxon test (paired and unpaired) and McNemar only work for compariong 2 groups/sets of data

If you have more than two you have a problem of multiple comparison

1 Ombibus tests

Null hypothesis - all sets come from the same population.

If the null is not true (p-value <0.05) then it is not true that all came from the same distribution, but we do not know which one/ones are different and which ones are not significantly different

Omnibus for parametric nonpaired data - ANOVA
for parametric paired data - repeated measured ANOVA
non parametric non-paired - Kruskal Wallis
non parametric paired data - Friedman test

In the statistical literature paired data is also called within subject designs and non paired data is called between subjects

2 post hoc tests

Pairwise comparisons (using the appropriate 2 groups tests) plus p-values corrections (because of multiple comparions)

p-value correction many methods:

bonferroni
holms and others
FDR - controlling only the false discovery rate (false positives)

Use Holms.

There are multiple comparions that result in a critical difference - difference between means or ranks below which the difference is not significative

Tukey Test / Honest Significant Difference / Tukey Kramer test for nonpaired parametric comparisons
Nemnyi test - for paired non-parametric comparions
others (paired parametric or nonpaired nonparametric) I dont know

3 Comparing your algorithm with many others (paired)

Specially relevant for Machine Learning

Friedman + pairwise comparions with Holms correction

4 Comparing a set of different algoritms (paired)

Friedman + Nemenyi = Demsar procedure

further extensions of the demsar procedure implemented in the scmamp package in R

5 How to show that your classifier is better than the competition (Special for the ML students)

Comparing 2 groups - not a multiple comparison problem

Cross validation - separate training and test subsets of the data

holdout
k fold
bootstrap
and one can repeat each of them

Usual cvs

80/20 or 70/30 holdout
5 or 10-fold
100 repetitions of bootstrap

Let us assume a 5-fold.

5 test subsets (each fold)
you could use a McNemar test (paired 0/1 data) but I have never seen anyone do that. (maybe because McNemar tests are less powerfull - last exercise!)
you could measure the accuracy on the 5 test sets, and comprare the 5 numbers using Wilcoxon - but number of data is too low
you could repeat the 5 fold 10 times, and have 50 numbers to compare - this is possible considered cheating.
there is a problem that the measures for each fold are not really independent - they share 3/5 of the training set. You cannot trust the statistical tests that assume that the data is independent paper
this paper propose the correlated t-test to deal with the non independence/correlation of the data. But I dont know of any paper that have use it.
this paper propose a 5 repetition of a 2-fold (calles 5x2cv) - you will get 10 measures of accuracy - the author claims this balances the problem of correlation and enough number of data

Table of Contents

1 Ombibus tests

2 post hoc tests

3 Comparing your algorithm with many others (paired)

4 Comparing a set of different algoritms (paired)

5 How to show that your classifier is better than the competition (Special for the ML students)