This is a blog for undergraduates grappling with stats!
Last week I was having a chat with an undegraduate student who was due to analyse some data. She was double-checking how to determine the statistical significance of her analysis. I mentioned that she could either use SPSS (which would provide the value directly), or obtain the t-value via hand-calculation and look up the critical value in the back of her textbooks. Below is the type of table I was referring to; something all undergraduate students are familiar with. This one is for the t-test:
She then asked a very interesting question: “Where do all of these numbers come from, anyway?”. I tried my best to explain, but without any writing utensils to hand, I felt I couldn’t really do the answer justice. So, I thought I’d write a short blog-post for other curious students asking the same question: Where do these numbers come from?
A Simple Experiment
Let’s say a researcher is interested in whether alcohol affects response time (RT). She recruits 30 people into the lab, and tests their RT whilst sober. Then, she plies them with 4 pints of beer and tests their RT again. (Please note this is a poor design. Bear with me.) She finds that mean RT whilst sober was 460ms (SD = 63.63ms), and was 507ms (SD = 73.93ms) whilst drunk. The researcher performs a paired-samples t-test on the data, and finds t(29) = 2.646. Using the table above, she notes that in order for the effect to be significant at the 5% level (typically used in psychology), the t-value needs to exceed 2.043. As the observed value does exceed this “critical value”, she declares the effect significant (p<.05).
Students are familiar with this procedure, and perform it plenty of times during their study. But how many have stopped to really ask “Why does the t-value need to exceed 2.043? Why not 2.243? Or 1.043?”.
A satisfctory answer requires an appreciation of what the p-value is trying to inform you. The correct definition of the p-value is the probability of observing a test statistic as extreme—or more extreme—as the one you have observed, if the null hypothesis is true. That is, if there is truly no effect of alcohol on RT performance, what is the probability you would observe a t-value equal or higher to 2.646 (the one obtained in analysis)? The statistics table tells us that—with 30 subjects (therefore 29 degrees of freedom)—that there is only a 5% chance of observing a t-value above 2.043 (or below -2.043). But where did this number come from?
The Power of Simulations
We can work out the answer to this question mathematically (and in fact this is often covered on statistics courses), but I think it is more powerful for students to see the answer via simulations. What we can do is simulate many experiments where we KNOW that the null hypothesis is true (because we can force the computer to make this so), and perform a t-test for each experiment. If we do this many times, we get a distribution of observed t-values when the null hypothesis is true.
Animation of the Simulation
Below is a gif animation of this simulation collecting t-values. This simulation samples 30 subjects in two conditions, where the mean and standard deviation of each condition is fixed at 0 and 1, respectively. This gif only demonstrates the collation of t-values up to 300 experiments. The histogram shows the frequency of certain t-values as the number of experiments increases. The red vertical lines show the critical values for the t-value for 29 degrees of freedom.
Note that as the simulation develops, the bulk of the distribution of observed t-values falls within the critical values (i.e., they are contained within the limits defined by the red lines). In fact, in the long-run, 95% of the distribution of t-values will fall within this window. This is where the critical value comes from! It’s the value for which, in the long-run, 95% of t-values from null experiments will fall below.
A Larger Simulation
To show this, I repeated the simulation but now increased the number of experiments to 100,000. The histogram is below.
As before, the bulk of the distribution is contained within the critical t-value range. If we count exactly what percentage of these simulated experiments produced a t-value of 2.043 or greater (or less than -2.043), we see this value is 4.99%—just off the 5% promised by the textbooks! Therefore, 95.01% of the simulated t-distribution falls within the red lines.
In this simulation, we repeated an experiment many times where the effect was known to be null. We found that 95% of the observed t-distribution fell within the range of -2.043 to 2.043. This is what the critical values are telling us. They are the t-values for which, in the long run, 95% of t-values will be less extreme than when there is no real effect. Therefore, so the argument goes, if you observe a more extreme value, this is reason to reject the null hypothesis.
The critical value changes depending on the degrees of freedom because the shape of the t-distribution under the null changes with the number of subjects in the experiment. For example, below is a histogram of null t-values in simulated experiments with 120 subjects. The textbooks tell us the critical value is 1.980. Therefore, we can predict that 95% of the distribution should fall within the window -1.980 to 1.980 (shown as the red lines below).
That’s where the numbers come from!