Problems with p-values

n=10
m=1000
a=matrix(rnorm(n*m),ncol=m)
b=matrix(rnorm(n*m),ncol=m)
same=NULL
for (i in 1:m) same=c(same,t.test(a[,i],b[,i])$p.value)

b1=matrix(rnorm(n*m,0.3),ncol=m)
dif1=NULL
for (i in 1:m) dif1=c(dif1,t.test(a[,i],b1[,i])$p.value)

b2=matrix(rnorm(n*m,0.5),ncol=m)
dif2=NULL
for (i in 1:m) dif2=c(dif2,t.test(a[,i],b2[,i])$p.value)

sum(same<0.05)/m

repeated testing by getting new data

for (i in 1:m){
   tt = t.test(a[,i],b[,i])$p.value
   print(paste(i,tt))
   if (tt<0.05) break
}

repeated testing by adding new data - in the limit this will not generate a false positive

for (i in 1:m){
newa=as.vector(a[,1:i])
newb=as.vector(b[,1:i])
   tt = t.test(newa,newb)$p.value
   print(paste(i,tt))
   if (tt<0.05) break
}

repeated testing by measuring many attributes: A paper on how to achieve publishable results (in nutrition) using multiples atributes (and small sample sizes)

natrib=20
ndata=15
a=matrix(rnorm(natrib*ndata),ncol=ndata)
b=matrix(rnorm(natrib*ndata),ncol=ndata)
for (i in 1:natrib){
   tt = t.test(a[,i],b[,i])$p.value
   if (tt<0.05) print(paste("attribute",i,"pvalue",tt))

repeated testing by subset analysis a recent case of error/fraud using subset analysis

a=rnorm(100)
b=rnorm(100)
print(t.test(a,b)$p.value)
print(rbinom(30,1,prob=0.4))

for (i in 1:200){
  suba=rbinom(100,1,prob=0.4)
  subb=rbinom(100,1,prob=0.4)
  tt=t.test(a[suba==1],b[subb==1])$p.value
  print(paste(i,tt))
  if (tt<0.05) break
}

Subgroup analisys may be very importnat for exploratory purposes - a drug that in general is not better than the control group - but works very well on high colesterol men, or on asian women, etc.

A guideline on how to report subgroup analysis in medicine papers

Repeated testing by further analysing the data and dropping outliers

a=rnorm(100)
b=rnorm(100)
print(t.test(a,b)$p.value)
suba=rep(1,100)
subb=rep(1,100)
for (i in 1:200){
  suba=suba & rbinom(100,1,prob=0.95)
  subb=suba & rbinom(100,1,prob=0.95)
  tt=t.test(a[suba==1],b[subb==1])$p.value
  print(paste(i,tt,sum(suba),sum(subb)))
  if (tt<0.05) break
}

In general the researcher makes choices on what to do, which data to drop, how to group data, how to analyse the data All choices are reasonable and he keeps doing that until he gets the p value he wants. A researcher class this the garden of forking paths The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time by Andrew Gelman and Eric Loken