Problems with p-values
n=10 m=1000 a=matrix(rnorm(n*m),ncol=m) b=matrix(rnorm(n*m),ncol=m) same=NULL for (i in 1:m) same=c(same,t.test(a[,i],b[,i])$p.value) b1=matrix(rnorm(n*m,0.3),ncol=m) dif1=NULL for (i in 1:m) dif1=c(dif1,t.test(a[,i],b1[,i])$p.value) b2=matrix(rnorm(n*m,0.5),ncol=m) dif2=NULL for (i in 1:m) dif2=c(dif2,t.test(a[,i],b2[,i])$p.value)
sum(same<0.05)/m
repeated testing by getting new data
for (i in 1:m){ tt = t.test(a[,i],b[,i])$p.value print(paste(i,tt)) if (tt<0.05) break }
repeated testing by adding new data - in the limit this will not generate a false positive
for (i in 1:m){ newa=as.vector(a[,1:i]) newb=as.vector(b[,1:i]) tt = t.test(newa,newb)$p.value print(paste(i,tt)) if (tt<0.05) break }
repeated testing by measuring many attributes: A paper on how to achieve publishable results (in nutrition) using multiples atributes (and small sample sizes)
natrib=20 ndata=15 a=matrix(rnorm(natrib*ndata),ncol=ndata) b=matrix(rnorm(natrib*ndata),ncol=ndata) for (i in 1:natrib){ tt = t.test(a[,i],b[,i])$p.value if (tt<0.05) print(paste("attribute",i,"pvalue",tt))
repeated testing by subset analysis a recent case of error/fraud using subset analysis
a=rnorm(100) b=rnorm(100) print(t.test(a,b)$p.value) print(rbinom(30,1,prob=0.4)) for (i in 1:200){ suba=rbinom(100,1,prob=0.4) subb=rbinom(100,1,prob=0.4) tt=t.test(a[suba==1],b[subb==1])$p.value print(paste(i,tt)) if (tt<0.05) break }
Subgroup analisys may be very importnat for exploratory purposes - a drug that in general is not better than the control group - but works very well on high colesterol men, or on asian women, etc.
A guideline on how to report subgroup analysis in medicine papers
Repeated testing by further analysing the data and dropping outliers
a=rnorm(100) b=rnorm(100) print(t.test(a,b)$p.value) suba=rep(1,100) subb=rep(1,100) for (i in 1:200){ suba=suba & rbinom(100,1,prob=0.95) subb=suba & rbinom(100,1,prob=0.95) tt=t.test(a[suba==1],b[subb==1])$p.value print(paste(i,tt,sum(suba),sum(subb))) if (tt<0.05) break }
In general the researcher makes choices on what to do, which data to drop, how to group data, how to analyse the data All choices are reasonable and he keeps doing that until he gets the p value he wants. A researcher class this the garden of forking paths The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time by Andrew Gelman and Eric Loken