Bayesian Estimation of Partial Correlations

Correlations are a popular analysis tool in psychology to examine the extent to which two variables are related. For example, one might be interested in whether there is a relationship between shoe size and height. But what if you want to explore the relationship between two variables whilst controlling for the effects of a third variable? For example, you might be interested in the relationship between shoe size and height whilst controlling for age. In order to do this, one needs to use partial correlations.

Whilst there are relatively straightforward methods to calculate partial correlations using frequentist statistics, I was interested recently whether there is a Bayesian parameter estimation version of this analysis. A cursory glance of the internet didn’t bring up much1, so I set about developing my own.

Below, I briefly cover the frequentist implementation before providing an overview of the Bayesian parameter estimation version.

Frequentist Implementation

Assume we have three variables, X, Y, and Z. We are interested in the relationship between X and Y, but need to account for the effect of Z. I simulated 75 data points (these could be participants in a study), and calculation of a standard Pearson’s correlation between X and Y provides rx,y(74) = 0.48, p<0.05. A “significant” relationship! However, this doesn’t take into account the effect of Z.

Examination of all possible correlations between X, Y, and Z reveals a strong relationship between Z and the other two variables:

  • rx,z(74) = 0.58, p<0.05.
  • ry,z(74) = 0.62, p<0.05.

Examining the relationship between X and Y whilst controlling for Z (denoted here as *rx,y|z*) is given by the following formula:

r_{x,y|z} = \frac{r_{x,y} - (r_{x,z} \times r_{y,z})}{\sqrt{1 - r_{x,z}^{2}} \times \sqrt{1 - r_{y,z}^{2}}}

When calculating this partial correlation between X and Y, controlling for Z, we get rx,y|z(74) = 0.19, p = .10. Note now that the relationship between X and Y is no longer “significant”. The Figure below shows the relationship with Regular (i.e., Pearson’s) and Partial correlation.

Bayesian Parameter Estimation

This is all well and good, but what if you want a Bayesian estimate of the partial correlation parameter? I had a quick check of the internet, and couldn’t see anything. (I should say that I didn’t spend long looking; I like to do some things from scratch just so I can learn something, and this was one of those times.) Inspired by the SUPERB book by Lee & Wagenmakers on Bayesian modelling, I devised a Bayesian graphical model of partial correlation:


This model extended the one presented in the Lee & Wagenmakers book which was used to calculate the correlation coefficient between just two variables. There are two extensions in the current model over that of Lee & Wagenmakers:

  • The model is extended to accommodate three variables of interest instead of two. As such, three correlation parameters are estimated from the data: rx,y, rx,z, and ry,z.2
  • The model has a new parameter, \theta , which denotes the partial correlation of interest in the current example, that is rx,y|z

The model assumes—as does the one by Lee & Wagenmakers—that the data are modelled as draws from a multivariate normal distribution. The parameters of this distribution are the means of each of the three variables (denoted \mu_{1}, \mu_{2},\mu_{3},  ) and their standard deviations (\sigma_{1}, \sigma_{2},\sigma_{3},  ), as well as the correlation coefficients that link them (r_{1,2}, r_{1,3}, r_{2,3},  ). I use broad, rather uninformative priors (which of course can be tweaked later if one wishes).

The parameter of interest, \theta , inherits the distributions from the model parameters pertaining to the correlation coefficients (r_{1,2}, r_{1,3}, r_{2,3},  ), and from them generates a new distribution of the partial correlation parameter (according to the first equation in this post). The distribution of interest for inference is now this new distribution pertaining to the partial correlation parameter.


Recall that the frequentist estimate of the partial correlation coefficient was rx,y|z(74) = 0.19. Below is a density plot of the posterior distribution of the \theta parameter from the Bayesian graphical model above3.

The mode of this posterior distribution was 0.205, with a 95% Highest-Density Interval spanning from -0.03 to 0.39. Note that whilst the modal estimate of the partial correlation parameter was close to that of the frequentist analysis, the Bayesian parameter estimation provides much more information, in particular regarding the uncertainty of this estimate.


I am not sure if this implementation is correct, so I advise using it with caution. But, it proved an interesting exercise, and I am not aware of a current implementation of this.


### initial set up
rm(list = ls())

# set working directory
setwd("D:/Work/Other/Blog_YouTube code/Blog/Bayesian Partial Correlation")

# load libraries

# set seed for reproducible code

### define functions

# function to simulate partially correlated data
get_data <- function(n){
  x <- rnorm(n, 0, 1)
  y <- .5 * x + rnorm(n, 0, 1)
  z <- .3 * x + .6 * y + rnorm(n, 0, 1)
  data <- data.frame(x = x, y = y, z = z)

### get data & conduct frequentist analysis

# generate the data
n <- 75
x <- get_data(n)

## plot the data with linear model fits
op = par(cex.main = 1.5, mar = c(5,6,4,5) + 0.1, mgp = c(3.5, 1,0), 
         cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las = 1)

# do pairs plot
pdf("pairs.pdf", width = 6, height = 6)
pairs(x, upper.panel = NULL, pch = 19)

# do correlation plot
pdf("correlation.pdf", width = 6, height = 6)
plot(x$y, x$x, pch = 17, ylab = "", xlab = "")

# model not controlling for z
mod_1 <- lm(x$x ~ x$y)
abline(a = mod_1$coefficients[1], b = mod_1$coefficients[2], lwd = 3, lty = 1, 
       col = "red")

# model controlling for z
mod_2 <- lm(x$x ~ x$y + x$z)
abline(a = mod_2$coefficients[1], b = mod_2$coefficients[2], lwd = 3, lty = 1, 
       col = "blue")
legend("bottomright", c("Regular", "Partial"), lty = c(1, 1), bty = "n", 
       lwd = 3, col = c("red","blue"), cex = 1.5)

# get the frequentist estimate of correlation & partial correlation
# note that the correlation between x and y is no longer significant when
# controlling for z
freq_r <- rcorr(as.matrix(x))
freq_partial_r <- pcor(x)

### Conduct Bayesian parameter estimation

# declare the JAGS model code
model_code <- "
model {

# data
for(i in 1:n){
x[i, 1:3] ~ dmnorm.vcov(mu[], TI[,])

# priors
mu[1] ~ dnorm(0, .001)
mu[2] ~ dnorm(0, .001)
mu[3] ~ dnorm(0, .001)

lambda[1] ~ dgamma(.001, .001)
lambda[2] ~ dgamma(.001, .001)
lambda[3] ~ dgamma(.001, .001)

r_xy ~ dunif(-1, 1)
r_xz ~ dunif(-1, 1)
r_yz ~ dunif(-1, 1)

# reparameterisation
sigma[1] <- 1/sqrt(lambda[1])
sigma[2] <- 1/sqrt(lambda[2])
sigma[3] <- 1/sqrt(lambda[3])

T[1, 1] <- 1 / lambda[1]
T[1, 2] <- r_xy * sigma[1] * sigma[2]
T[1, 3] <- r_xz * sigma[1] * sigma[3]

T[2, 1] <- r_xy * sigma[1] * sigma[2]
T[2, 2] <- 1 / lambda[2]
T[2, 3] <- r_yz * sigma[2] * sigma[3]

T[3, 1] <- r_xz * sigma[1] * sigma[3]
T[3, 2] <- r_yz * sigma[2] * sigma[3]
T[3, 3] <- 1 / lambda[3]

TI[1:3, 1:3] <- T[1:3, 1:3]

# partial correlation calculation
num <- r_xy - (r_xz * r_yz)
denom <- sqrt(1 - pow(r_xz, 2)) * sqrt(1 - pow(r_yz, 2))
partial <- num/denom

# model details
jags_info <- list("x", "n")
parameters <- c("r_xy", "r_xz", "r_yz", "mu", "sigma", "partial")

# fit the model
sample <- jags(jags_info, inits = NULL, parameters, 
               model.file = textConnection(model_code), n.chains = 1, 
               n.iter = 10000, n.burnin = 500, n.thin = 5, DIC = F)

# look at the overview of the parameter estimates

# extract the posterior samples of the partial correlation (4th column) &
# calculate the 95% HDI
posterior <- sample$BUGSoutput$sims.matrix[, 4]
sample_mcmc <- as.mcmc(posterior)
hdi <- HPDinterval(sample_mcmc)

###  plot

# do some preparation for plotting by finding the mode of the posterior
dens <- density(posterior)
posterior_mode <- dens$x[which.max(dens$y)]

# do the plot
pdf("bayesian_estimate.pdf", width = 6, height = 6)
plot(dens, xlim = c(-1, 1), ylim = c(0, max(dens$y) + 0.55), 
     main = "", xlab = "Partial Correlation Estimate", lwd = 2)

# add the mode of the sample & the HDI etc.
lines(x=c(posterior_mode, posterior_mode), y=c(2.5, 3.8), lty = 2, lwd = 2, 
      col = "red")
text(x= posterior_mode, y = max(dens$y) + 0.5, paste("Posterior mode =",
                                         round(posterior_mode, 3), sep = " "), 
     cex = 1.2, col = "red")
lines(x = c(hdi[1],hdi[1]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[2],hdi[2]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[1],hdi[2]), y = c(0.1,0.1), lwd = 2, col = "red")
text(x = (hdi[1] + hdi[2]) / 2, y = 0.325, "95% HDI", cex = 1.2, col = "red")

  1. Note that there is a paper from Wetzel & Wangenmakers (2012) which demonstrates the calculation of Bayes factors for correlation and partial correlation using summary statistics (i.e., Pearson’s r and the degrees of freedom). Note also (as I say in the main post) that I didn’t search that hard for a solution to this problem as I was keen to make my own method. So, there is probably a better way of doing this that is already available. 
  2. Note that in the model of Lee & Wagenmakers with just two variables, one must take the inverse of the variance–covariance matrix when passing it to the dmnorm function in JAGS. With three or more variables, there are issues when taking the inverse of the matrix because it is then not positive definite. (See this link for the help I received from the JAGS owner on this issue) The solution involves using JAGS versions newer than 4.3.0, and using the dmnorm.cov function instead. 
  3. As this is just a toy example, I kept the model-fit time as quick as possible. I generated 10,000 samples from the posterior distributions, treating the first 500 as burn-in samples. The thinning rate was set to 5, and just one chain was used. Note that this is not optimal, but the fit time was quite slow for the model. 

Low Power & Effect Sizes

Yesterday I posted the following tweet which has since turned out to be my most popular tweet EVER with hundreds of retweets and “likes” in 24 hours:

My motivation for the tweet was quite straightforward. I have recently been emailing academics in my department every week with different topics in an attempt to raise awareness of topics associated with increasing the information value of the research we are conducting as a department. This week’s topic was “Power”. In my email—in which I included a copy of Button et al.’s (2013) excellent paper on low-power in the neurosciences—I mentioned in passing that power is not just an issue for statistical significance. I have heard from people before that low power is only an issue when interpreting null results, and that if a study produces a significant outcome, then power is not an issue.

To pre-empt this response to my encouragement to increase the power of our studies, I said in my email: “Studies with low power and significant effects have been shown to over-estimate effect sizes, meaning your low-powered study—although significant—is not giving you precision.”

As soon as I sent the email, I realised that I couldn’t recall ever reading a study that had demonstrated this. Now, I knew that such a study (or studies) would have been conducted, but I realised that I had never actually read it myself. It turns out that such studies have indeed been conducted before, as people helpfully pointed out to me on Twitter in response to my tweet:

As I was unaware of these studies—plus it was a Sunday, and I was a little bored—I thought instead of doing a literature search I would code a simulation demonstrating the inflation of effect sizes with low-powered, significant, studies the results of which I emailed to my department to demonstrate that what I had said was indeed the case. Then I thought, “Well, I haven’t tweeted much this year, so why not put it on Twitter, too.”

The incredible engagement I have had with this tweet—I propose—is due to this being a rather under-appreciated fact. Indeed, I “knew” that low-powered studies over-estimate effect sizes, but I didn’t KNOW it in the sense that I had seen hard evidence for it.


Details of Simulation

Because my tweet was made in passing, I didn’t explain in much detail about the stimulation implementation. I discuss this here in case others want to extend the simulation in some way.

The effect size of interest is a measure of correlation between two measures. I arbitrarily chose IQ (mean = 100, SD  = 20) and response time (mean = 600ms, SD = 80ms). I fixed the “true” effect size to be r = 0.3. It turns out that to obtain 80% power for an r=0.3 requires 85 subjects. In my simulation, I wanted to explore a wide range of sample sizes, so chose the set 10, 20, 30, 50, 85, 170, 500, and 1000.

For each sample size—N—I simulated 1,000 “studies”. For each simulated study, the following procedure occurred:

  • Sample N draws from a multivariate normal distribution with the means and SD for IQ and RT as above and a population correlation coefficient of 0.3
  • Conduct a Pearson’s correlation between the two samples
  • If the correlation was significant, store the observed correlation coefficient in a new data frame
  • If the correlation was not significant, move on without storing anything
  • After 1,000 studies are completed, plot a boxplot of the observed effect sizes for N

The result was the image in the tweet.


Many limitations exist to this simulation, and I point interested readers to the material cited above in others’ tweets for a more formal solution. I didn’t intend for this to be a rigorous test, so it shouldn’t be taken too seriously; it was more for my own curiosity and also to provide a graphical image I could send to my colleagues at Keele to show the imprecision of effect sizes with low power. The particular outcomes are likely sensitive to my choice of means, SDs, r, etc. So, don’t generalise the specifics of this simulation, but maybe code your own tailored to your study of interest.

For me this was a bit of fun. Ten minutes of coding was time well spent on Sunday!

Solution to #BarBarPlots in R

I came across an interesting project the other day which is calling for a reconsideration of the use of bar plots (#barbarplots), with the lovely tag-line “Friends don’t let friends make bar plots!”. The project elegantly outlines convincing reasons why bar plots can be misleading, and have successfully funded a campaign to “…increase awareness of the limitations that bar plots have and the need for clear and complete data visualization”.

In this post, I want to show the limitations of bar plots that these scientists have highlighted. Then, I provide a solution to these limitations for researchers who want to continue using bar plots that can easily be cobbled together using R-statistics (with the ggplot2 package).

The Data

Say you are a researcher who collects some data (it doesn’t matter on what) from two independent groups and you are interested in whether there is a difference between them. Most researchers would maybe calculate the mean and standard error of each group to describe the data. Then the researcher might plot the data using a bar plot, together with error bars representing the standard error. To provide an inferential test on whether a difference exists, the researcher would usually conduct an independent samples t-test.

Let’s provide some example data for two conditions:

  • condition A (n = 100): mean of 200.17, a median of 196.43, and a standard error of 6.12
  • condition B (n = 100): mean of 200.11, a median of 197.87, and a standard error of 7.19

Here is the bar plot:


Pretty similar, right? The researcher sees that there is little evidence for a difference; to test this inferentially they conduct an independent samples t-test, with the outcome t(198) = 0.007, p = .995, Cohen’s d < 0.001. The researcher concludes there is no difference between the two groups.

The Problem

The problem raised by the #barbarplot campaign is that bar plots are a poor summary of the distribution of data. The bar plot above suggests there is no difference between the two groups, but the two groups are different! How do I know they are different? I simulated the data. What the bar plot hides is the shape of the underlying distribution of each data set. Below I present a density plot (basically a smoothed histogram) of the same data as above:


Now we can see that the two groups are clearly different! Condition A is a normal distribution, but condition B is bi-modal. The bar plot doesn’t capture this difference.

The Solution

Density plots are a nice solution to presenting the distribution of data, but can get really messy when there are multiple conditions (imagine the above density plot but with 4 or more overlapping conditions). Plus, researchers are used to looking at bar plots, so there is something to be said about continuing their use (especially for factorial designs). But how do we get around the problem highlighted by the #barbarplot campaign?

One solution is to plot the bar plots as usual, but to overlay the bar plot with individual data points. Doing this allows the reader to see the estimates of central tendency (i.e., to interpret the bar plot as usual), whilst at the same time allowing the reader to see the spread of data in each condition. This sounds tricky to do (and it probably is if you are still using Excel; yes, I’m talking to you!), but it’s simple if you’re using R.

Below is the above data plotted as a combined bar and point plot. As you can see, the difference in distribution is now immediately apparent, whilst retaining the advantages of a familiar bar plot. Everyone wins!



R Code

Below is the R code for the combined plot. This includes some code that generates the artificial data used in this example.

# load required packages

#--- Generate artificial data

# set random seed so example is reproducible

# generate condition A
condition <- rep("condition_A", 100)
dv_A <- rnorm(100, 200, 60)
condition_A <- data.frame(condition, dv = dv_A)

# generate condition B
condition <- rep("condition_B", 100)
dv_B <- c(rnorm(50, 130, 10), rnorm(50, 270, 10))
condition_B <- data.frame(condition, dv = dv_B)

# put all in one data frame
raw_data <- rbind(condition_A, condition_B)

# calculate sumary statistics
data_summary <- raw_data %>%
  group_by(condition) %>%
  summarise(mean = mean(dv), 
            median = median(dv),
            se = (sd(dv)) / sqrt(length(dv)))

#--- Do the "combined" bar plot
p2 <- ggplot()

# first draw the bar plot
p2 <- p2 + geom_bar(data = data_summary,
                    aes(y = mean,x = condition,
                        ymin = mean - se,
                        ymax = mean + se), fill = "darkgrey",
                    stat="identity", width=0.4)

# draw the error bars on the plot
p2 <- p2 + geom_errorbar(data = data_summary,
                         aes(y = mean, x = condition,
                             ymin = mean - se,
                             ymax = mean + se), stat = "identity",
                         width = 0.1, size = 1)

# now draw the points on the plot
p2 <- p2 + geom_point(data = raw_data, aes(y = dv, x = condition),
                      size = 3, alpha = 0.3,
                      position = position_jitter(width = 0.3, height = 0.1))

# scale and rename the axes, and make font size a bit bigger
p2 <- p2 + coord_cartesian(ylim = c(50, 400))
p2 <- p2 + scale_x_discrete(name = "Condition") +
  scale_y_continuous(name = "DV")

p2 <- p2 + theme(axis.text = element_text(size = 12),
                 axis.title = element_text(size = 14,face = "bold"))

# view the plot


“Bayesian in 8 Easy Steps” Journal Club

I’ve been trying to organise an online journal club to discuss the papers suggested in Alexander Etz and colleagues’ paper “How to become a Bayesian in 8 easy steps”. Several people have filled out the Doodle poll expressing an interest, but unfortunately not everyone can make the same time. As such, I am afraid I will have to go with the time which the majority of people can make. I am sorry that this will leave some people out.

The most popular day & time was Thursdays at 1pm UTC. Therefore, I propose the first meeting be on Thursday 10th March at 1pm. It will be on Google Hangouts, but I need to spend some time working out how to use this before I pass on details of the meet.

Please complete the following Doodle poll to indicate whether you can make this meeting or not. Please also provide your email address together with your name in the poll. Then, I can move the discussion to email rather than having to post on my blog for updates.

See you there!

(Pesky?) Priors

When I tell people I am learning Bayesian statistics, I tend to get one of two responses: either people look at me blankly—“What’s Bayesian statistics?”—or I get scorned for using such “loose” methods—“Bayesian analysis is too subjective!”1. This latter “concern” arises due to (what I believe to be a misunderstanding of) the prior: Bayesian analysis requires one state what one’s prior belief is about a certain effect, and then combine this with the data observed (i.e., the likelihood) to update one’s belief (the posterior).

On the face of it, it might seem odd for a scientific method to include “subjectivity” in its analysis. I certainly had this doubt when I first started learning it. (And, in order to be honest with myself, I still struggle with it sometimes.) But, the more I read, the more I think this concern is not warranted, as the prior is not really “subjectivity” in the strictest sense of the word at all: it is based on our current understanding of the effect we are interested in, which in turn is (often) based on data we have seen before. Yes, sometimes the prior can be a guess if we2 have no other information to go on, but we would express the uncertainty of a belief in the prior itself.

The more I understand Bayesian statistics, the more I appreciate the prior is essential. One under-stated side-effect of having priors is that it can protect you from dubious findings. For example, I have a very strong prior against UFO predictions; therefore, you are going to have to present me with a lot more evidence than some shaky video footage to convince me otherwise. You would not have to provide me with much evidence, however, if you claimed to have roast beef last night. Extraordinary claims require extraordinary evidence.

But, during my more sceptical hours, I often succumbed to the the-prior-is-nothing-but-subjectivity-poisoning-your-analysis story. However, I now believe that even if one is sceptical of the use of a prior, there are a few things to note:

  • If you are concerned your prior is wrong and is influencing your inferences, just collect more data: A poorly-specified prior will be washed away with sufficient data.

  • The prior isn’t (really) subjective because it would have to be justified to a sceptical audience. This requires (I suggest) plotting what the prior looks like so readers can familiarise themselves with your prior. Is it really subjective if I show you what my prior looks like and I can justify it?

  • Related to the above, the effect of the prior can be investigated using robustness checks, where one plots the posterior distribution based on a range of (plausible) prior values. If your conclusions don’t depend upon the exact prior used, what’s the problem?

  • Priors are not fixed. Once you have collected some data and have a posterior belief, if you wish to examine the effect further you can (and should) use the posterior from the previous study as your prior for the next study.

These are the points I mention to anti-Bayesians I encounter. In this blog I just wanted to skip over some of these with examples. This is selfish; it’s not really for your education (there really are better educators out there: My recommendation is Alex Etz’s excellent “Understanding Bayes” series, from where this blog post takes much inspiration!). I just want somewhere with all of this written down so next time someone criticises my interest in Bayesian analysis I can just reply: “Read my blog!”. (Please do inform me of any errors/misconceptions by leaving a comment!)

As some readers might not be massively familiar with these issues, I try to highlight some of the characteristics of the prior below. In all of these examples, I will use the standard Bayesian “introductory tool” of assessing the degree of bias in a coin by observing a series of flips.

A Fair Coin

If a coin is unbiased, it should produce roughly equal heads and tails. However, often we don’t know whether a coin is biased or not. We wish to estimate the bias in the coin (denoted theta) by collecting some data (i.e., by flipping the coin); a fair coin has a theta = 0.5. Based on this data, we can calculate the likelihood of various theta values. Below is the likelihood function for a fair coin.


In this example, we flipped the coin 100 times, and observed 50 heads and 50 tails. Note how the peak of the likelihood is centered on theta = 0.5. A biased coin would have a true theta not equal to 0.5; theta closer to zero would reflect a bias towards tails, and a theta closer to 1 would reflect a bias towards heads. The animation below demonstrates how the likelihood changes as the number of observed heads (out of 100 flips) increases:


So, the likelihood contains the information provided by our sample about the true value for theta.

The Prior

Before collecting data, Bayesian analysts would specify what their prior belief was about theta. Below I present various priors a Bayesian may have using the beta distribution (which has two parameters: a and b):betas-1

The upper left plot reflects a prior belief that the coin is fair (i.e., the peak of the distribution is centered over theta = 0.5); however, there is some uncertainty in this prior as the distribution has some spread. The upper right plot reflects total uncertainty in a prior belief: that is, the prior holds that any value of theta is likely. The lower two plots reflect prior beliefs that the coin is biased. Maybe the researcher had obtained the coin from a known con-artist. The lower left plot reflects a prior for a biased coin, but uncertainty about which side the coin is biased towards (that is, it could be biased heads or tails); the lower right plot reflects a prior that the coin is biased towards heads.

The effect of the prior

I stated above that one of the benefits of the prior is that it allows protection (somewhat) from spurious findings. If I have a really strong prior belief that the coin is fair, 9/10 heads isn’t going to be all that convincing evidence that it is not fair. However, if I have a weak prior that the coin is fair, then I will be quite convinced by the data.

This is illustrated below. Both priors below reflect the belief that the coin is fair; what differs between the two is the strength in this belief. The prior on the left is quite a weak belief, as the distribution (although peaked at 0.5) is quite spread out. The prior on the right is a stronger belief that the coin is fair.

In both cases, the likelihood is the result of observing 9/10 heads.


You can see that when the prior is a weak belief, the posterior is very similar to the likelihood; that is, the posterior belief is almost entirely dictated by the data. However, when we have a strong prior belief, our beliefs are not altered much by observing just 9/10 heads.

Now, I imagine that this is the anti-Bayesian’s point: “Even with clear data you haven’t changed your mind.” True. Is this a negative? Well, imagine instead this study was assessing the existence of UFOs rather than simple coin flips. If I showed you 9 YouTube videos of UFO “evidence”, and 1 video showing little (if any) evidence, would you be convinced of UFOs? I doubt it. You were the right-hand plot in this case. (I know, I know, the theta distribution doesn’t make sense in this case, but ignore that!)

What if the prior is wrong?

Worried that your prior is wrong3, or that you cannot justify it completely? Throw more data at it. (When is this ever a bad idea?) Below are the same priors, but now we flip the coin 1,000 times and observe 900 heads. (Note, the proportion heads is the same in the previous example.) Now, even our strong prior belief has to be updated considerably based on this data. With more data, even mis-specified priors do not affect inference.


To get an idea of how sample size influences the effect of the prior on the posterior, I created the below gif animation. In it, we have a relatively strong (although not insanely so) prior belief that the coin is biased “heads”. Then, we start flipping the coin, and update the posterior after each flip. In fact, this coin is fair, so our prior is not in accord with (unobservable) “reality”. As flips increases, though, our posterior starts to match the likelihood in the data. So, “wrong” priors aren’t really a problem. Just throw more data at it.


“Today’s posterior is tomorrow’s prior” — Lindley (1970)

After collecting some data and updating your prior, you now have a posterior belief of something. If you wish to collect more data, you do not use your original prior (because it no longer reflects your belief), but you instead use the posterior from your previous study as the prior for your current one. Then, you collect some data, update your priors into your posteriors…and so on.

In this sense, Bayesian analysis is ultimately “self-correcting”: as you collect more and more data, even horrendously-specified priors won’t matter.

In the example below, we have a fairly-loose idea that the coin is fair—i.e., theta = 0.5. We flip a coin 20 times, and observe 18 heads. Then we update to our posterior, which suggests the true value for theta is about 0.7 ish. But then we wish to run a second “study”; we use the posterior from study 1 as our prior for study 2. We again observe 18 heads out of 20 flips, and update accordingly.



One of the nicest things about Bayesian analysis is that the way our beliefs should be updated in the face of incoming data is clearly (and logically) specified. Many peoples’ concerns surround the prior. I hope I have shed some light on why I do not consider this to be a problem. Even if the prior isn’t something that should be “overcome” with lots of data, it is reassuring to know for the anti–Bayesian that with sufficient data, it doesn’t really matter much.

So, stop whining about Bayesian analysis, and go collect more data. Always, more data.

Click here for the R code for this post

  1. Occasionally (althought his is happening more and more) I get a slow, wise, agreeing nod. I like those. 
  2. I really wanted to avoid the term “we””, as it implies I am part of the “in-group”: those experts of Bayesian analysis who truly appreciate all of its beauty, and are able to apply it to all of their experimental data. I most certainly do not fall into this camp; but I am trying. 
  3. Technically, it cannot be “wrong” because it is your belief. If that belief is justifiable, then it’s all-game. You may have to update your prior though if considerable data contradict it. But, bear with me. 

Statistics Tables: Where do the Numbers Come From?

This is a blog for undergraduates grappling with stats!

Last week I was having a chat with an undegraduate student who was due to analyse some data. She was double-checking how to determine the statistical significance of her analysis. I mentioned that she could either use SPSS (which would provide the value directly), or obtain the t-value via hand-calculation and look up the critical value in the back of her textbooks. Below is the type of table I was referring to; something all undergraduate students are familiar with. This one is for the t-test:


She then asked a very interesting question: “Where do all of these numbers come from, anyway?”. I tried my best to explain, but without any writing utensils to hand, I felt I couldn’t really do the answer justice. So, I thought I’d write a short blog-post for other curious students asking the same question: Where do these numbers come from?

A Simple Experiment

Let’s say a researcher is interested in whether alcohol affects response time (RT). She recruits 30 people into the lab, and tests their RT whilst sober. Then, she plies them with 4 pints of beer and tests their RT again. (Please note this is a poor design. Bear with me.) She finds that mean RT whilst sober was 460ms (SD = 63.63ms), and was 507ms (SD = 73.93ms) whilst drunk. The researcher performs a paired-samples t-test on the data, and finds t(29) = 2.646. Using the table above, she notes that in order for the effect to be significant at the 5% level (typically used in psychology), the t-value needs to exceed 2.043. As the observed value does exceed this “critical value”, she declares the effect significant (p<.05).

Students are familiar with this procedure, and perform it plenty of times during their study. But how many have stopped to really ask “Why does the t-value need to exceed 2.043? Why not 2.243? Or 1.043?”.

A satisfctory answer requires an appreciation of what the p-value is trying to inform you. The correct definition of the p-value is the probability of observing a test statistic as extreme—or more extreme—as the one you have observed, if the null hypothesis is true. That is, if there is truly no effect of alcohol on RT performance, what is the probability you would observe a t-value equal or higher to 2.646 (the one obtained in analysis)? The statistics table tells us that—with 30 subjects (therefore 29 degrees of freedom)—that there is only a 5% chance of observing a t-value above 2.043 (or below -2.043). But where did this number come from?

The Power of Simulations

We can work out the answer to this question mathematically (and in fact this is often covered on statistics courses), but I think it is more powerful for students to see the answer via simulations. What we can do is simulate many experiments where we KNOW that the null hypothesis is true (because we can force the computer to make this so), and perform a t-test for each experiment. If we do this many times, we get a distribution of observed t-values when the null hypothesis is true.

Animation of the Simulation

Below is a gif animation of this simulation collecting t-values. This simulation samples 30 subjects in two conditions, where the mean and standard deviation of each condition is fixed at 0 and 1, respectively. This gif only demonstrates the collation of t-values up to 300 experiments. The histogram shows the frequency of certain t-values as the number of experiments increases. The red vertical lines show the critical values for the t-value for 29 degrees of freedom.


Note that as the simulation develops, the bulk of the distribution of observed t-values falls within the critical values (i.e., they are contained within the limits defined by the red lines). In fact, in the long-run, 95% of the distribution of t-values will fall within this window. This is where the critical value comes from! It’s the value for which, in the long-run, 95% of t-values from null experiments will fall below.

A Larger Simulation

To show this, I repeated the simulation but now increased the number of experiments to 100,000. The histogram is below.


As before, the bulk of the distribution is contained within the critical t-value range. If we count exactly what percentage of these simulated experiments produced a t-value of 2.043 or greater (or less than -2.043), we see this value is 4.99%—just off the 5% promised by the textbooks! Therefore, 95.01% of the simulated t-distribution falls within the red lines.

Summing Up

In this simulation, we repeated an experiment many times where the effect was known to be null. We found that 95% of the observed t-distribution fell within the range of -2.043 to 2.043. This is what the critical values are telling us. They are the t-values for which, in the long run, 95% of t-values will be less extreme than when there is no real effect. Therefore, so the argument goes, if you observe a more extreme value, this is reason to reject the null hypothesis.

The critical value changes depending on the degrees of freedom because the shape of the t-distribution under the null changes with the number of subjects in the experiment. For example, below is a histogram of null t-values in simulated experiments with 120 subjects. The textbooks tell us the critical value is 1.980. Therefore, we can predict that 95% of the distribution should fall within the window -1.980 to 1.980 (shown as the red lines below).


That’s where the numbers come from!

Recruitment Order & Sequential Bayes Factors

One advantage of using Bayes factors (or any other form of Bayesian analysis) is the ability to engage in optional stopping. That is, one can collect some data, perform the critical Bayesian test, and stop data collection once a pre-defined criterion has been obtained (e.g., until “strong” evidence has been found in support of one hypothesis over another). If the criterion has not been met, one can resume data collection until it has.

Optional stopping is problematic when using p-values, but is “No problem for Bayesians!” (Rouder, 2014; see also this blog post from Kruschke). A recent paper in press at Psychological Methods (Shoenbrodt et al., in press) shows you how to use optional stopping with Bayes factors—so-called “sequential Bayes factors” (SBF).

The figure below a recent set of data from my lab where we used SBF for optional stopping. Following the advice of the SBF paper, we set our stopping criteria as N > 20 and BF10 > 6 (evidence in favour of the alternative) or BF10 < 1/6 (evidence in favour of the null); these criteria are shown as horizontal dashed lines. The circles represent the progression of the Bayes factor (shown in log scale) as the sample size increased. We stopped at 76 participants, with “strong” evidence supporting the alternative

One thing troubled me when I was writing these results up: I could well imagine a skeptical reviewer noticing that the Bayes factor seemed to be in support of the null between sample sizes of 30 and 40, although the criterion for stopping was not reached. A reviewer might wonder, therefore, whether had some of the later participants who showed no effect (see e.g., participants 54–59) been recruited earlier, would we have stopped data collection and found evidence in favour of the null? Put another way, how robust was our final result to the order in which our sample were recruited to the experiment?

Assessing the Effect of Recruitment Order

I was not aware of any formal way to assess this question, so for the purposes of the paper I was writing I knocked up a quick simulation. I randomised the recruitment order of our sample, and plotted the sequential Bayes factor for this “new” recruitment order. I performed this 50 times, plotting the trajectory of the SBF. I reasoned that if our conclusion was robust against recruitment order, the stopping rule in favour of the null should be rarely (if ever) met. The results of this simulation are below. As can be seen, the stopping criterion for the null was never met, suggesting our results were robust to recruitment order.



Recruitment Order Simulation

Then, I became curious: How robust are SBFs to recruitment order when we know what the true effect is? I simulated a set of 100 participants for an effect known to be small (d = 0.3), and plotted the SBF as “participants” were recruited. I tried different random seeds until I found a pattern I was interested in: I wanted a set of data where the SBF comes very close to the stopping rule in favour of the null, even though we know the alternative is true. It took me a few tries, but eventually I got the plot below.bfSim

Note the dip toward the null criterion is even more extreme than in my experiment. Thus, this set of data seemed perfect to test my method for assessing the robustness of the final conclusion (“strong evidence for alternative”) against recruitment order. I followed the same procedure as before, simulating 50 random recruitment orders, and plotted the results below.orderSim

The null criterion is only met in one simulated “experiment”. This seems to be very pleasing, at least for the paper I am due to submit soon: the SBF (at least in these two small examples) seem robust against recruitment order.

R Code

If you would like to see the R code for the above plots, then you can download the html file for this post which includes R code.

Polls: Why sample size matters

The people of Scotland are currently deciding in an upcoming “Yes-No” referendum (this Thursday!) whether to leave the British Union and become an independent nation. As is typical around times of heightened political interest, media outlets are awash with polls attempting to predict the outcome.  Recent polls have put the “No” response at about 52%, although this is reported to be reducing. 

What frustrates me about the reporting of these polls is that quite often—always?—the media do not inform us the sample size of the poll. It’s all well and good stating the “No” campaign has a 52% lead, but I don’t care about the point estimate (i.e., 52%); I care about how much error there is in this estimate, and—as a consequence—how this influences my confidence in the estimate of 52%. Large error means the point estimate is next-to useless; small error means it is likely more believable. The degree of error of the estimate tends to reduce as sample size increases, so without this information, I cannot assess how much error there is likely to be in the estimate.

Sample Statistics & Credible Intervals

In statistics, we are interested in what’s called a population parameter, which (roughly) corresponds to the “true” value of a test of interest if we were able to sample the whole population. This is certainly what pollsters are interested in: they want to predict the outcome of the referendum.

Of course, this is never possible, so we take samples from the main population, and on the basis of the results obtained from this sample, estimate the true population parameter. Thus, our interest is in the population parameter, and we only use the sample parameter as a proxy to estimate what really interests us. The sample parameter is our point estimate from our sample, and we hope that it is representative of the true population parameter.

It’s very unlikely that the sample estimate exactly matches the population parameter, so ideally we would like to see a range of credible estimates of the true population parameter (a credible interval), rather than just a point estimate. For example, how confident can we be that the “true” result will be 52% given our sample responded 52%? How much more likely is 51% to be a true estimate? What about 49% (which would completely change the result of the referendum)? It’s impossible to tell from a point estimate. 

I wanted to demonstrate how the size of the sample used can influence our confidence in the point estimates presented in these polls. To do this, I used R and Bayesian estimation to run some quick simulations of Yes/No polls whilst varying the total number of “people” asked. What I was interested in was how the confidence in the estimation of the population parameter varied as sample size increased. 

The results are shown below. If you would like more detail about the simulation (and Bayesian estimation techniques), please see the end of the post. Disclaimer: I am no Bayesian expert, so if there are errors below I would love to hear about them. The message remains the same, though, even if the details might differ slightly.


The results for each sample size are shown in the Figure below. The sample size is shown at the top of each plot (“N = sample size”). polls


These plots show probability density functions across all possible parameters (proportion = N[o], on the x-axis). It shows how we should assign credibility across the parameter space, given our prior beliefs and our data (see simulation details for these). Parameters that fall within the “bulk” of the distribution are far more likely to be the “true” parameter than parameters that fall outside the bulk of the distribution. Thus, Bayesian analysis provides us with a way to assess the range of credible parameters by finding those values which 95% of the density falls over. This is shown as red line in each of the above plots, and is known as the 95% Highest-Density Interval (HDI). 

Values that fall within the HDI are more credible estimates of the true population parameter than those values that fall outside of it. Narrow HDIs suggest we have been able to provide a very precise estimate of the true parameter; wide HDIs reflect uncertainty in our estimates of the true parameter. 

Remember, the value I used to simulate this data was Proportion N[o] = .52 (52%). We see that, for all sample sizes, the median of the distribution (i.e., the peak of the distribution) is roughly above this figure. Thus, the “point estimate” for most of these plots would coincide with the pollsters with 52%.

But, look at the width of the density function (and the red HDI) as sample size increases. For only 50 people being polled, the HDI runs from 0.38 to 0.68 (see top of figure for numbers). This means that our best estimate of the percentage of people saying “NO” ranges from 38% (landslide for the Yes campaign) to 65% (landslide for the No campaign). I’m not sure about you, but I’m not very confident that 52% is a very good estimate here, as there is such wild uncertainty. 

On the contrary, with 2000 respondents, the HDI is much more narrow (because the density is much narrower); this reflects a very precise estimate of the population parameter; but even then, there is a 4% window of credible outcomes. That is, the true response could be anywhere from 50% (ambivalent) to 54% (clear edge to No campaign). 

Take-Home Message

Most, if not all, of the polls being conducted are sufficiently sampled. What gets my goat though is not the polls, but how they are reported. And this is not just true for this campaign. How many times have you seen an advert stating “…58% of people agree…”. My response is always the same: Without information on the sample size, I just don’t trust the percentage. 

Simulation Overview

If you would like the R-code I used, please shoot me an email! It uses the scripts presented in Lee & Wagenmakers’ (2014) SUPERB book “Bayesian Cognitive Modelling”. 

  1. I am interested in estimating the proportion of people responding “No”; I fixed this at the most recent estimate of 52%. Therefore, 52% is the “true” population parameter we are trying to estimate, and I simulated data using this percentage.
  2. I varied the sample size of the simulated poll from 50 (ridiculously small) up to 2000 (pretty large) in various steps. In each simulated sample size, I fixed it so that 52% responded “No”.  
  3. I used Bayesian parameter estimation to determine—for each sample size—our best estimate of the population parameter. Beyesian estimation provides a complete probability distribution over the entire parameter space (called the posterior distribution), which basically shows how we should assign credibility across all of the possible parameter values , given our prior expectations and given the data. Wide posterior distributions suggest a large number of possible parameters are candidates to be the “true” parameter, whereas narrow posterior distributions suggest we have narrowed down the possibilities considerably. 
  4.  The prior is how much probability we provided to each parameter value before we saw any data. In our example, we need to provide a probability distribution over all possible percentages of responding “No” based on our expectations. For the simulation, I set the prior to be a uniform distribution over all parameters (a beta[1,1] for those interested; that is, the prior was very neutral in terms of our expectations. 
  5. Then, I set the responses to 52% of the sample size. This is our data. I modelled responding as being a binomial process, with k = proportion responding “No”, and n = total sample size. 
  6. Based on our prior and data, I used rjags to conduct the analysis, which produces one probability density function across all possible parameter values for each sample size. This is shown in the Figure presented. 

How Poker Imitates Academia: The case of Expected Value

The summer couldn’t come too quickly for me this year. After a very busy spring/summer semester, I couldn’t wait to go on my holidays. I was fortunate enough to spend a week on the Caribbean, and it was  just what this doctor ordered.

During my stay, I noticed that the hotel had a daily poker tournament. I had dabbled with poker before, as my brother was (and now is again) a very keen player. I’ve played in live tournaments, and even finished in the top-10 of a European online tournament a few years ago. I know my way around a pack of cards; needless to say, I signed up quickly. Below is a photo of me (with the cap) on my way to tournament victory at the resort. 


Poker is skill, NOT luck

Upon my return from holiday, I decided to read as much as I can about poker theory, and it’s been a blast. Contrary to what most people believe, poker is a game of skill, not luck. Every year, the same players top the “best player” lists, and sit on final tables of massive tournaments; this would not be possible if poker were primarily a luck game. For a statistic nerd like me, I relish in the fact that it’s all about understanding the numbers. You have to understand probabilities of certain hands occurring, odds of certain cards appearing in the future, and pitting these against how much money you can win on the current hand (so-called “pot-odds”).

The beauty of poker is that if you have a good understanding of these odds and probabilities (and can narrow down the potential cards your opponents have), you can pretty much guarantee that you can play profitably—at least at the lower stakes.  The key is to think long-term, not short-term.  When deciding whether to put all of your chips in to call someone else’s bet, you have to weigh up the utility (or profitability) of that bet over the long-term (i.e., over several occurrences) rather than worry what will happen on this specific hand. Results can vary wildly from hand-to-hand, but probabilities are probabilities, and in the long-term everything will settle down to this probability (over hundreds of hands, for example).

For example, here is a hand I played on the way to winning the above tournament in Mexico. I looked down and had the following two cards:


Now, these are not great starting cards in no-limit Texas hold ’em, but I decided to play the hand anyway. A fellow competitor looked at his cards and all of a sudden looked very excited, before putting some chips in the middle. Based on his excitement, I believed he had AA, KK, QQ, or a high connector like AK, AQ, AJ. The flop and turn appeared with no further betting: 


At this stage, there were 100 chips in the pot, and there was only me and one other guy in play. At this point, he put 25 chips into the pot, making it 125 chips. I have to pay 25 chips to continue, but at this stage, I’m pretty sure he has got at least a pair of Kings, and therefore has me beat. Even worse, before the hand, I thought he had KK, which would mean he now has three Kings. The only thing that can save me is to get a card with a heart on the final card, making me an A-high flush, which beats his trip Kings.

But what are the odds of this happening? With only one more card to be drawn, the odds of this happening are 4.11:1 against (or a probability of ~.19). Should I make the call? How can I work out whether this is profitable? 

The beauty of poker is that you make your decision NOT on what you THINK will happen on THIS hand; this is what gamblers do. They decide whether they feel “like a punt” and put their chips in the middle if they feel it’s their lucky day. Poker players don’t do this (at least, profitable ones don’t). They use something called expected value (EV), which is basically a statistic for how much one can expect to profit if one were to repeat the trial many times. That is, given how much I stand to win (or lose) and the probability of me winning (or losing), how much would I expect to profit over the long term if I were to repeat this scenario many times. It even has a nice formula:

EV = (pW * $W) – (pL *$L)

where pX is the probability of winning (W) or losing (L), and $X is how much I stand to win (W) or lose (L). So, my EV for the above betting proposition is

EV = (.19 * 125) – (.81 * 25) = +3.50

So, this IS a profitable call despite my somewhat-long odds of actually getting a heart on the final card. Therefore, I should call. Over the long-term, repeating this call in identical situations will always be profitable for me. In this case, however, I missed a heart on the final card, he flipped over KK (3 Kings total), and won the pot.


What the hell does this have to with academia?

What matters in the above example is thinking of the utility of actions over the long-term, rather than getting distracted by short-term results. I lost this hand (and a reasonable proportion of my chip stack); does this mean I shouldn’t have made the call? NO! Over the long-term, this is always the correct play; the short-term result (losing) will always be washed out in the long-term.

You can think of the hand-to-hand results as noise, and the EV as the signal. Calling is always a good idea in this situation, but not every hand will win every time. Profitable poker players know this, and don’t get (too) upset at losing these individual hands, because in the long-term, they know it was a correct play that will profit in the long-term.

As I was reading about EV and the idea of “taking the long-view” that this is much like paper rejections in academia. The “play” in this case is all the hard work and dedication that goes into a piece of research, and the “result” is a win (accept!) or a loss (reject!). Sometimes, p(win) is actually very low (as was the case in my example); many journals have an acceptance rate of ~20%, which actually matches the probability of my hand improving to the winning one above (this was purely coincidental). But, the potential reward ($W) is actually very high (you know how good it feels to get a paper accepted, let alone the career benefits they bring!), so the play is always correct. Work hard, keep submitting, and in the long-term, your publishing EV will be positive, despite short-term set-backs. You can expect these set-backs often (almost 80% of the time), but like the poker player who doesn’t grumble about short-term losses (like mine above), keep putting your chips in.

To summarise:

  1. Think long-term, not short-term: Don’t be too despondent about individual rejections.
  2. Don’t be put off by low-probability of success if potential reward is high
  3. Keep making plays (keep working hard!).

Now, go back to your lab, and shuffle up and deal.

Friedman’s test with tied data

I completely sympathise with students sometimes when they say they hate statistics. I really do. A colleague emailed me the other day in response to a student query regarding difficulty in getting their hand-calculation for a Friedman test to match that produced by SPSS (shout-out to this eagle-eyed student!!). I smiled to myself, thinking “silly SPSS, giving the wrong results again!”. Then, in a flurry of smugness, I fired up R, entered the data my colleague had provided, and found to my utter surprise that R, too, was producing the “incorrect” result. I checked the hand calculations in Excel and they were all fine. So how come R and SPSS were giving different (apparently incorrect) results?

Here are the data (including the ranks):




The equation used in the hand-calculations was

reg Friedman


where n is the number of observations per condition (12), k is the number of conditions (3), and Ri is the sum of the ith column (condition). It’s a gnarly-looking equation, and students gulp when it’s presented.

However, it turns out this equation is only correct if there are no ties in your data. My colleague informed me this was not mentioned in the book he had access to, and I have since checked 3 books that I have at home (haven’t checked the ones in my office, yet), and all of them give the above formula and make no mention of a correction-for-ties.

The interweb didn’t help much, either, in trying to work out what was going wrong, but I stumbled across a book chapter on Google books (link is here) which gives a corrected formula. To help others, it’s repeated below:

correct Friedman

Wikipedia to the rescue (!)

It turns out that Wikipedia has an excellent page for Friedman’s test which provides yet another (set of) formula(s). I’m not quite sure the original source that Wikipedia got this equation from, but it turns out that this is a general formula that works for tied data and non-tied data alike. So, it makes sense that R might be using this formula (it would be more economical than having to continually test whether it should use the first or the second equation reported above).

What have I learned?

  • Students are very observant. Had this student not contacted my colleague (and he not subsequently contact me), this error would have crept into next year, too. Genuine thanks to them for coming forward and questioning my lecture slides. 
  • Statistics is hard. How can we expect students to “get” all of this when I didn’t even know there was a mistake here.
  • Most textbooks discussing this test don’t use tied data, and so use the first equation above. This does not work with tied data. You can use the second equation, but best use the generalised equation on Wikipedia.
  • Wikipedia is (sometimes, at least!) accurate!
  • R is still ace.

For those interested, here is the R code for performing the Friedman equations from Wikipedia. This is just to show what it’s doing; you can just see the R help pages for ?friedman.test for a shortcut to computing the test.

rankData <-
  matrix(c(2.5, 2.5, 1,
           3, 1.5, 1.5,
           1.5, 3, 1.5,
           2, 3, 1,
           1.5, 3, 1.5,
           3, 1.5, 1.5,
           2, 2, 2,
           3, 1.5, 1.5,
           2.5, 1, 2.5,
           3, 1.5, 1.5,
           2, 3, 1,
           2.5, 2.5, 1),
         nrow = 12,
         byrow = TRUE,
         dimnames = list(1:12, c("None", "Classical", "Dance")))

rbar_none <- (1/12) * sum(rankData[, 1])
rbar_classical <- (1/12) * sum(rankData[, 2])
rbar_dance <- (1/12) * sum(rankData[, 3])

rbar <- (1/(3*12)) * sum(rankData)

ssT <- 12* sum(((rbar_none - rbar)^2), ((rbar_classical - rbar) ^ 2), ((rbar_dance - rbar) ^ 2))
ssE <- (1/(12*2)) * sum(((rankData - rbar) ^ 2))

Q <- ssT/ssE