Statistical testing is a widely-used approach to the identification of features in a dataset that are associated with a metabolic regime, a disease susceptibility, a signalling defect. Such features tend to stand out and statistical testing is designed to identify things whose behaviour is somewhat out of the ordinary (ie, not easily explained by a null hypothesis \(H_0\)). Statistical testing plays a central role in any modern biology analysis. It is therefore important to understand how it works.
A \(p\) curve is the distribution of \(p\) values arising from using a given statistical model. The distribution of \(p\) values can be useful, for instance, in the evaluating the extent and consequences of \(p\) hacking in science (Head et al. PLoS Biol 2015).
The \(p\) curve under no effect (null hypothesis)
Suppose that \((X_1, \dots, X_{10})\) and \((Y_1, \dots, Y_{10})\) are two sets of independent, identically-distributed random variables \[X_i \sim \mathcal{N}(0, 1) \qquad Y_i \sim \mathcal{N}(0, 1)\] where \(\mathcal{N}(\mu, \sigma^2)\) refers to a normal distribution of mean \(\mu\) and variance \(\sigma^2\). The random variable \(P\) is the \(p\) value calculated using Student’s \(t\) test to evaluate whether there’s a significant difference between the set of \(X\)s and the set of \(Y\)s.
Using simulations, show how \(P\) is distributed.
Indication: rnorm()
can be used to draw random numbers from a normal distribution and t.test()
can be used to test for mean equality between two samples. Check out the documentation.
The \(p\) curve for a non-zero effect
Suppose that \((X_1, \dots, X_{10})\) and \((Y_1, \dots, Y_{10})\) are two sets of independent, identically-distributed random variables \[X_i \sim \mathcal{N}(0, 1) \qquad Y_i \sim \mathcal{N}(\delta, 1)\] Choose for instance \(\delta = .3\). The random variable \(P\) is the \(p\) value calculated using Student’s \(t\) test to evaluate whether there’s a significant difference between the set of \(X\)s and the set of \(Y\)s.
Using simulations, show how \(P\) is distributed.
The \(p\) curve under non-standard conditions
Suppose that \((X_1, \dots, X_{10})\) and \((Y_1, \dots, Y_{10})\) are two sets of independent, identically-distributed random variables \[X_i \sim \mathrm{Exp}(1) \qquad Y_i \sim \mathrm{Exp}(1)\] Here “Exp” refers to an exponential distribution. The random variable \(P\) is the \(p\) value calculated using Student’s \(t\) test to evaluate whether there’s a significant difference between the set of \(X\)s and the set of \(Y\)s.
Using simulations, show how \(P\) is distributed.
Indication: rexp()
can be used to draw random numbers from an exponential distribution.