18 Exercises
This course has emphasised viewing data as probabilistic. We have generated synthetic data from explicit mechanisms by simulating with random numbers, and used those samples to study the logic of statistical testing and analysis.
Testing involves the interplay of sample size, significance level, statistical power, and effect size, and we examined how probabilistic decisions emerge as functions of these.
The ability to generate synthetic data has also let us simulate questionable research practices (QRPs) and conduct sample-size planning. Linear models and their extensions can likewise be approached from the perspective of the data-generating mechanism.
We have surveyed the spread of linear models — general linear models, generalised linear models, generalised linear mixed models, hierarchical linear models — each refining the fit to its data. Complex is not, of course, better; simpler models are preferable when they suffice, but applying a model that does not fit the data merely because it is simple is inappropriate.
As models grow more complex, estimation methods have moved from least squares to maximum likelihood to Bayesian estimation. All three estimate the same population parameters; none is intrinsically superior, and they should arrive at the same substantive conclusion. The choice can feel ideological, but as users we are free to pick whichever is most useful in practice.
For more advanced work we also touched on probabilistic programming languages, which let us write down data-generating mechanisms directly and have them yield parameter estimates.
Using a probabilistic programming language requires programming skill, an understanding of Bayesian statistics, and familiarity with MCMC theory and practice. Acquiring these skills, however, vastly broadens the range of analyses available to you. Rather than picking from a menu of pre-existing models, you can design your own. The degrees of freedom are essentially unlimited, and the creative process of imagining and constructing a model is genuinely enjoyable. Along the way you will also better understand how the existing canonical models are themselves designed.
MCMC realises probability as random numbers. Whether via MCMC or other techniques, working with random samples to give probability a concrete form is the path to becoming fluent with statistical models. Tools are to be used, not to use us.
18.1 Final exercises
In a test of zero correlation, suppose the true correlation is \(\rho = 0.4\) and you draw a sample of size \(n = 20\). Plot, on the same axes, the sampling distribution under the null and the sampling distribution under the truth, and visualise the critical value at \(\alpha = 0.05\) and the resulting power. (Reference: 南風原 (2002), p. 144.)
In a two-factor between-subjects ANOVA design, write code that produces a synthetic dataset for which only the interaction is significant. Also include the
anovakunanalysis of the synthetic data to confirm it.Write code that produces a dataset of three groups, each with two variables \(X\) and \(Y\), with a correlation of around \(r = -0.3\) within each group but a positive overall correlation when the groups are pooled. Visualise the data as a scatter plot colour-coded by group.
- Hint: generate regression-style data for each group with a constant slope \(\beta_1 = -0.3\), varying the intercept \(\beta_0\) across groups.
- Goal: demonstrate the importance of visualisation when interpreting correlations, and motivate the use of hierarchical linear models.