Bayesian Estimation of Partial Correlations

Correlations are a popular analysis tool in psychology to examine the extent to which two variables are related. For example, one might be interested in whether there is a relationship between shoe size and height. But what if you want to explore the relationship between two variables whilst controlling for the effects of a third variable? For example, you might be interested in the relationship between shoe size and height whilst controlling for age. In order to do this, one needs to use partial correlations.

Whilst there are relatively straightforward methods to calculate partial correlations using frequentist statistics, I was interested recently whether there is a Bayesian parameter estimation version of this analysis. A cursory glance of the internet didn’t bring up much1, so I set about developing my own.

Below, I briefly cover the frequentist implementation before providing an overview of the Bayesian parameter estimation version.

Frequentist Implementation

Assume we have three variables, X, Y, and Z. We are interested in the relationship between X and Y, but need to account for the effect of Z. I simulated 75 data points (these could be participants in a study), and calculation of a standard Pearson’s correlation between X and Y provides rx,y(74) = 0.48, p<0.05. A “significant” relationship! However, this doesn’t take into account the effect of Z.

Examination of all possible correlations between X, Y, and Z reveals a strong relationship between Z and the other two variables:

  • rx,z(74) = 0.58, p<0.05.
  • ry,z(74) = 0.62, p<0.05.

Examining the relationship between X and Y whilst controlling for Z (denoted here as *rx,y|z*) is given by the following formula:

r_{x,y|z} = \frac{r_{x,y} - (r_{x,z} \times r_{y,z})}{\sqrt{1 - r_{x,z}^{2}} \times \sqrt{1 - r_{y,z}^{2}}}

When calculating this partial correlation between X and Y, controlling for Z, we get rx,y|z(74) = 0.19, p = .10. Note now that the relationship between X and Y is no longer “significant”. The Figure below shows the relationship with Regular (i.e., Pearson’s) and Partial correlation.

Bayesian Parameter Estimation

This is all well and good, but what if you want a Bayesian estimate of the partial correlation parameter? I had a quick check of the internet, and couldn’t see anything. (I should say that I didn’t spend long looking; I like to do some things from scratch just so I can learn something, and this was one of those times.) Inspired by the SUPERB book by Lee & Wagenmakers on Bayesian modelling, I devised a Bayesian graphical model of partial correlation:


This model extended the one presented in the Lee & Wagenmakers book which was used to calculate the correlation coefficient between just two variables. There are two extensions in the current model over that of Lee & Wagenmakers:

  • The model is extended to accommodate three variables of interest instead of two. As such, three correlation parameters are estimated from the data: rx,y, rx,z, and ry,z.2
  • The model has a new parameter, \theta , which denotes the partial correlation of interest in the current example, that is rx,y|z

The model assumes—as does the one by Lee & Wagenmakers—that the data are modelled as draws from a multivariate normal distribution. The parameters of this distribution are the means of each of the three variables (denoted \mu_{1}, \mu_{2},\mu_{3},  ) and their standard deviations (\sigma_{1}, \sigma_{2},\sigma_{3},  ), as well as the correlation coefficients that link them (r_{1,2}, r_{1,3}, r_{2,3},  ). I use broad, rather uninformative priors (which of course can be tweaked later if one wishes).

The parameter of interest, \theta , inherits the distributions from the model parameters pertaining to the correlation coefficients (r_{1,2}, r_{1,3}, r_{2,3},  ), and from them generates a new distribution of the partial correlation parameter (according to the first equation in this post). The distribution of interest for inference is now this new distribution pertaining to the partial correlation parameter.


Recall that the frequentist estimate of the partial correlation coefficient was rx,y|z(74) = 0.19. Below is a density plot of the posterior distribution of the \theta parameter from the Bayesian graphical model above3.

The mode of this posterior distribution was 0.205, with a 95% Highest-Density Interval spanning from -0.03 to 0.39. Note that whilst the modal estimate of the partial correlation parameter was close to that of the frequentist analysis, the Bayesian parameter estimation provides much more information, in particular regarding the uncertainty of this estimate.


I am not sure if this implementation is correct, so I advise using it with caution. But, it proved an interesting exercise, and I am not aware of a current implementation of this.


### initial set up
rm(list = ls())

# set working directory
setwd("D:/Work/Other/Blog_YouTube code/Blog/Bayesian Partial Correlation")

# load libraries

# set seed for reproducible code

### define functions

# function to simulate partially correlated data
get_data <- function(n){
  x <- rnorm(n, 0, 1)
  y <- .5 * x + rnorm(n, 0, 1)
  z <- .3 * x + .6 * y + rnorm(n, 0, 1)
  data <- data.frame(x = x, y = y, z = z)

### get data & conduct frequentist analysis

# generate the data
n <- 75
x <- get_data(n)

## plot the data with linear model fits
op = par(cex.main = 1.5, mar = c(5,6,4,5) + 0.1, mgp = c(3.5, 1,0), 
         cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las = 1)

# do pairs plot
pdf("pairs.pdf", width = 6, height = 6)
pairs(x, upper.panel = NULL, pch = 19)

# do correlation plot
pdf("correlation.pdf", width = 6, height = 6)
plot(x$y, x$x, pch = 17, ylab = "", xlab = "")

# model not controlling for z
mod_1 <- lm(x$x ~ x$y)
abline(a = mod_1$coefficients[1], b = mod_1$coefficients[2], lwd = 3, lty = 1, 
       col = "red")

# model controlling for z
mod_2 <- lm(x$x ~ x$y + x$z)
abline(a = mod_2$coefficients[1], b = mod_2$coefficients[2], lwd = 3, lty = 1, 
       col = "blue")
legend("bottomright", c("Regular", "Partial"), lty = c(1, 1), bty = "n", 
       lwd = 3, col = c("red","blue"), cex = 1.5)

# get the frequentist estimate of correlation & partial correlation
# note that the correlation between x and y is no longer significant when
# controlling for z
freq_r <- rcorr(as.matrix(x))
freq_partial_r <- pcor(x)

### Conduct Bayesian parameter estimation

# declare the JAGS model code
model_code <- "
model {

# data
for(i in 1:n){
x[i, 1:3] ~ dmnorm.vcov(mu[], TI[,])

# priors
mu[1] ~ dnorm(0, .001)
mu[2] ~ dnorm(0, .001)
mu[3] ~ dnorm(0, .001)

lambda[1] ~ dgamma(.001, .001)
lambda[2] ~ dgamma(.001, .001)
lambda[3] ~ dgamma(.001, .001)

r_xy ~ dunif(-1, 1)
r_xz ~ dunif(-1, 1)
r_yz ~ dunif(-1, 1)

# reparameterisation
sigma[1] <- 1/sqrt(lambda[1])
sigma[2] <- 1/sqrt(lambda[2])
sigma[3] <- 1/sqrt(lambda[3])

T[1, 1] <- 1 / lambda[1]
T[1, 2] <- r_xy * sigma[1] * sigma[2]
T[1, 3] <- r_xz * sigma[1] * sigma[3]

T[2, 1] <- r_xy * sigma[1] * sigma[2]
T[2, 2] <- 1 / lambda[2]
T[2, 3] <- r_yz * sigma[2] * sigma[3]

T[3, 1] <- r_xz * sigma[1] * sigma[3]
T[3, 2] <- r_yz * sigma[2] * sigma[3]
T[3, 3] <- 1 / lambda[3]

TI[1:3, 1:3] <- T[1:3, 1:3]

# partial correlation calculation
num <- r_xy - (r_xz * r_yz)
denom <- sqrt(1 - pow(r_xz, 2)) * sqrt(1 - pow(r_yz, 2))
partial <- num/denom

# model details
jags_info <- list("x", "n")
parameters <- c("r_xy", "r_xz", "r_yz", "mu", "sigma", "partial")

# fit the model
sample <- jags(jags_info, inits = NULL, parameters, 
               model.file = textConnection(model_code), n.chains = 1, 
               n.iter = 10000, n.burnin = 500, n.thin = 5, DIC = F)

# look at the overview of the parameter estimates

# extract the posterior samples of the partial correlation (4th column) &
# calculate the 95% HDI
posterior <- sample$BUGSoutput$sims.matrix[, 4]
sample_mcmc <- as.mcmc(posterior)
hdi <- HPDinterval(sample_mcmc)

###  plot

# do some preparation for plotting by finding the mode of the posterior
dens <- density(posterior)
posterior_mode <- dens$x[which.max(dens$y)]

# do the plot
pdf("bayesian_estimate.pdf", width = 6, height = 6)
plot(dens, xlim = c(-1, 1), ylim = c(0, max(dens$y) + 0.55), 
     main = "", xlab = "Partial Correlation Estimate", lwd = 2)

# add the mode of the sample & the HDI etc.
lines(x=c(posterior_mode, posterior_mode), y=c(2.5, 3.8), lty = 2, lwd = 2, 
      col = "red")
text(x= posterior_mode, y = max(dens$y) + 0.5, paste("Posterior mode =",
                                         round(posterior_mode, 3), sep = " "), 
     cex = 1.2, col = "red")
lines(x = c(hdi[1],hdi[1]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[2],hdi[2]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[1],hdi[2]), y = c(0.1,0.1), lwd = 2, col = "red")
text(x = (hdi[1] + hdi[2]) / 2, y = 0.325, "95% HDI", cex = 1.2, col = "red")

  1. Note that there is a paper from Wetzel & Wangenmakers (2012) which demonstrates the calculation of Bayes factors for correlation and partial correlation using summary statistics (i.e., Pearson’s r and the degrees of freedom). Note also (as I say in the main post) that I didn’t search that hard for a solution to this problem as I was keen to make my own method. So, there is probably a better way of doing this that is already available. 
  2. Note that in the model of Lee & Wagenmakers with just two variables, one must take the inverse of the variance–covariance matrix when passing it to the dmnorm function in JAGS. With three or more variables, there are issues when taking the inverse of the matrix because it is then not positive definite. (See this link for the help I received from the JAGS owner on this issue) The solution involves using JAGS versions newer than 4.3.0, and using the dmnorm.cov function instead. 
  3. As this is just a toy example, I kept the model-fit time as quick as possible. I generated 10,000 samples from the posterior distributions, treating the first 500 as burn-in samples. The thinning rate was set to 5, and just one chain was used. Note that this is not optimal, but the fit time was quite slow for the model. 

Reproducibility Article in “The Conversation”

I was asked to write 200-300 words on my views on whether there is a reproducibility crisis in the sciences for an article that was appearing in The Conversation. I was so passionate about what I was writing that I ended up writing over 1,200 words. The final article was, of course, edited down by their team to meet the 300 word guide. Below I have posted my full piece.

That there is a reproducibility crisis in psychological science—and arguably across all sciences—is, to me, beyond doubt. Murmurings of low reproducibility began in 2011— the so-called “year of horrors” for psychological science (Wagenmakers, 2012), with the infamous fraud case of Diedrik Stapel being its low-light. But murmurings now have empirical evidence. In 2015, the Open Science Collaboration published the findings of our large-scale effort to closely-replicate 100 studies in psychology (Open Science Collaboration, 2015). And the news was not good: Only 36% of studies were replicated.

Whilst low reproducibility is not unique to psychological science—indeed, cancer biology is currently reviewing its own reproducibility rate, and things are not looking great (see Baker & Dolgin, 2017)—psychology is leading the way in getting its house in order. Several pioneering initiatives have been introduced which, if embraced by the community, will leave psychological science in a strong position moving forward. Here I focus on three I believe are the most important.

Study Pre-Registration & Registered Reports

In a delightfully concerning study, Simonsohn et al. (2013) demonstrated that, in the absence of any true effect, researchers can find statistically significant effects in their studies by engaging in questionable research practices (QRPs), such as selectively reporting outcome measures that produced significant effects and dropping experimental conditions that produced no effect. Another QRP could include analysing your data in a variety of ways (for example, maybe a couple of participants didn’t show the effect you were looking for, so why not remove them from the analysis and see whether that “clears things up”?). What was concerning about this study is that many of these QRPs were not really considered “questionable” at the time. Indeed, many researchers have admitted to engaging in such QRPs (John et al., 2013).

As such, I do not believe that the presence of QRPs reflect explicit attempts at fraud. Rather, they likely stem from a blurred distinction between exploratory and confirmatory research. In exploratory research, many measures might be taken, many experimental conditions administered, and the data scrutinised using a variety of approaches looking for interesting patterns. Confirmatory research tests explicit hypotheses using pre-planned methods and analytical strategies. Both approaches are valid—exploratory research can generate interesting questions, and confirmatory research can address these questions—but what is not valid is to report an exploratory study as though it were confirmatory (Wagenmakers et al., 2012); that is, to find an effect in exploratory research and to publish the finding together with a narrative that this effect was expected all along.

Many researchers have started to pre-register their studies detailing their predictions, experimental protocols, and planned analytical strategy before data collection begins. When the study is submitted for publication, researchers can demonstrate that no QRPs have occurred because they can point to a time-stamped document verifying their plans before data collection commenced, leading to an increase in confidence in the claims reported. This is confirmatory research at its finest.

Some journals have taken this one stage further, by introducing Registered Reports, where papers containing details of a study’s rationale and detailed proposed methods are reviewed and accepted (or rejected!) for publication before the experiment has been conducted. The neuroscience journal Cortex—with their Registered Reports Editor Professor Chris Chambers of Cardiff University—has led the way with this format. Many other journals have now started to offer such reports.

This is an important contribution to the academic publishing structure because it incentivises best research practice. Here research is judged on the soundness of the methods and the importance of the question being addressed, and not the particular results of the study. Current incentive structures in our universities—together with general pressure for increased publications (the so-called “publish or perish” attitude)—leads researchers to prioritise “getting it published” over “getting it right” (Nosek et al., 2012), potentially leading to implicit or explicit use of QRPs to ensure a publishable finding. With the advent of Registered Reports, researchers can finally do both: prioritise “getting it right” by submitting a strong and well-evidenced research proposal, and it will be published regardless of what the data say.

Open Data, Open Materials

Science works by independent verification, not by appeal to authority. As noted by Wicherts and colleagues (2011), independent verification of data analysis is important because “…analyses of research data are quite error prone, accounts of statistical results may be inaccurate, and decisions that researchers make during the analytical phase of a study may lean towards the goal of achieving a preferred (significant) result” (p. 1). Given this importance, most journal policies ask for researchers to make available their data. Yet, when asked for their data, Wicherts and colleagues (2006) found just 73% of researchers provided their data when asked. Some researchers have begun to refuse to review journal article submissions unless the authors provide their data (or provide a convincing reason for why this is not possible) as part of the Peer-Reviewers’ Openness Initiative (see Morey et al., 2015); after all, if a reviewer cannot access the data a paper is based upon, how can a full review be completed?

The flagship psychology journal Psychological Science since 2014 has incentivised researchers to share their experimental material and data by providing badges to studies that comply to open practices by publishing data and materials together with their papers in the journal. (The journal offers a third badge if the study is pre-registered.) This intervention has been remarkably effective: Kidwell et al. (2016) reported that 23% of studies in Psychological Science provided open data, a rise from lower than 3% before the badges were in use. More journals are now encouraging authors to make their data open as a consequence.

Registered Replication Reports

I tell my students all the time that “replication is the most important statistic” (source of quote unknown). To me, an empirical finding in isolation doesn’t mean all that much until it has been replicated. In my own lab, I make an effort to replicate an effect before trying to publish it. As my scientific hero Richard Feynman is famous for saying “Science is a way of trying not to fool yourself… …and you are the easiest person to fool”. As scientists, we have a professional responsibility to ensure the findings we are reporting are robust and reproducible.

But we must also not allow others’ findings to fool us, either. That is why replication of other people’s findings should become a core component of any working lab (a course of action we have facilitated by publishing a “Replication Recipe”: a guide to performing convincing replications; Brandt et al., 2014).

You’d be forgiven for thinking that reports of replications must be common place in the academic literature. This is not the case. Many journals seek novel theories and/or findings, and view replications as treading over old ground. As such, there is little incentive for career-minded academics to conduct replications. However, if the results of the Open Science Collaboration (2015) tell us nothing else, it is that old ground needs to be re-trodden.

The Registered Replication Report format in the high-impact journal Perspectives on Psychological Science seeks to change this. In this format, many teams of researchers each independently perform a close replication of an important finding in the literature, all following an identical and shared protocol of study procedures. The final report—a single paper with all contributing researchers gaining authorship—collates the findings across all teams in a meta-analysis to firmly establish the size and reproducibility of an effect. Such large-scale replication attempts in a high-profile journal such as Perspectives can only help to encourage psychological scientists to view replication as a valid area of  their research programme.


2011 was described as a year of horrors for psychological science. Whilst certainly improvements can be made, our discipline has made impressive strides to improve our science. In just 6 years psychological has moved from a discipline in crisis to a discipline leading the way in how to conduct strong, rigorous, reproducible research.


Baker, M, & Dolgin, E. (2017). Cancer reproducibility project releases first results. Nature, 541, 7637, 269.

Brandt, M.J., IJzerman, H., Dijksterhuis, A., Farach, F., Geller, J., Giner-Sorolla, R., Grange, J.A., Perugini, M., Spies, J., & van ‘t Veer, A. (2014). The replication recipe: What makes for a convincing replication? Journal of Experimental Social Psychology, 50, 214-224.

John L. K., Loewenstein G., Prelec D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524–532

Kidwell, M.C., Lazarević, L.B., Baranski, E., Hardwicke, T.E., Piechowski, S., Falkenberg, L-S., et al. (2016). Badges to acknowledge open practices: A simple, low-cost, effective method for increasing transparency. PLoS Biology, 14(5), e1002456.

Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., . . . Zwaan, R. A. (2016). The peer reviewers’ openness initiative: Incentivizing open research practices through peer review. Royal Society Open Science, 3(1), 150547.

Nosek, B.A., Spies, J.R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 6, 615-631.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, 943.

Simons, D.J., Holcolmbe, A.O., & Spellman, B.A. (2014). An Introduction to Registered Replication Reports at Perspectives on Psychological Science. Perspectives on Psychological Science, 9, 552–555

Wagenmakers, E.-J. (2012). A year of horrorsDe Psychonoom, 27, 12-13.

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H.L.J., & Kievit, R. (2012). An agenda for purely confirmatory research. Persepctives on Psychological Science, 7, 632-638.

Wicherts, J.M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS ONE 6(11), e26828.

Wicherts, J.M., Borsboom, D., Kats, J., & Molenaar, D. (2006) The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728.

Low Power & Effect Sizes

Yesterday I posted the following tweet which has since turned out to be my most popular tweet EVER with hundreds of retweets and “likes” in 24 hours:

My motivation for the tweet was quite straightforward. I have recently been emailing academics in my department every week with different topics in an attempt to raise awareness of topics associated with increasing the information value of the research we are conducting as a department. This week’s topic was “Power”. In my email—in which I included a copy of Button et al.’s (2013) excellent paper on low-power in the neurosciences—I mentioned in passing that power is not just an issue for statistical significance. I have heard from people before that low power is only an issue when interpreting null results, and that if a study produces a significant outcome, then power is not an issue.

To pre-empt this response to my encouragement to increase the power of our studies, I said in my email: “Studies with low power and significant effects have been shown to over-estimate effect sizes, meaning your low-powered study—although significant—is not giving you precision.”

As soon as I sent the email, I realised that I couldn’t recall ever reading a study that had demonstrated this. Now, I knew that such a study (or studies) would have been conducted, but I realised that I had never actually read it myself. It turns out that such studies have indeed been conducted before, as people helpfully pointed out to me on Twitter in response to my tweet:

As I was unaware of these studies—plus it was a Sunday, and I was a little bored—I thought instead of doing a literature search I would code a simulation demonstrating the inflation of effect sizes with low-powered, significant, studies the results of which I emailed to my department to demonstrate that what I had said was indeed the case. Then I thought, “Well, I haven’t tweeted much this year, so why not put it on Twitter, too.”

The incredible engagement I have had with this tweet—I propose—is due to this being a rather under-appreciated fact. Indeed, I “knew” that low-powered studies over-estimate effect sizes, but I didn’t KNOW it in the sense that I had seen hard evidence for it.


Details of Simulation

Because my tweet was made in passing, I didn’t explain in much detail about the stimulation implementation. I discuss this here in case others want to extend the simulation in some way.

The effect size of interest is a measure of correlation between two measures. I arbitrarily chose IQ (mean = 100, SD  = 20) and response time (mean = 600ms, SD = 80ms). I fixed the “true” effect size to be r = 0.3. It turns out that to obtain 80% power for an r=0.3 requires 85 subjects. In my simulation, I wanted to explore a wide range of sample sizes, so chose the set 10, 20, 30, 50, 85, 170, 500, and 1000.

For each sample size—N—I simulated 1,000 “studies”. For each simulated study, the following procedure occurred:

  • Sample N draws from a multivariate normal distribution with the means and SD for IQ and RT as above and a population correlation coefficient of 0.3
  • Conduct a Pearson’s correlation between the two samples
  • If the correlation was significant, store the observed correlation coefficient in a new data frame
  • If the correlation was not significant, move on without storing anything
  • After 1,000 studies are completed, plot a boxplot of the observed effect sizes for N

The result was the image in the tweet.


Many limitations exist to this simulation, and I point interested readers to the material cited above in others’ tweets for a more formal solution. I didn’t intend for this to be a rigorous test, so it shouldn’t be taken too seriously; it was more for my own curiosity and also to provide a graphical image I could send to my colleagues at Keele to show the imprecision of effect sizes with low power. The particular outcomes are likely sensitive to my choice of means, SDs, r, etc. So, don’t generalise the specifics of this simulation, but maybe code your own tailored to your study of interest.

For me this was a bit of fun. Ten minutes of coding was time well spent on Sunday!

The Polls Weren’t Wrong

TL;DR: Trump had a 28% chance to win. We shouldn’t be surprised he won.

I’m not going to comment on the political outcome of last week’s US Presidential elections; enough ink—both pen-ink and eye-ink—has been spilled about that. What I am going to comment on though is the growing feeling the polls were wrong, and why an understanding of probability and statistical evidence might lead us to a more positive conclusion (or at least, less negative).

After last week’s “shock” result—read on for why shock is in scare-quotes—many news articles began to ask “Why were the polls wrong?”. (For a list, see the Google search here). This question is largely driven by the fact influential pollsters heavily favoured a Clinton victory. For example, FiveThirtyEight’s polls predicted a 71.4% chance of a Clinton victory. The New York Times predicted an 85% chance of a Clinton victory. Pretty convincing, huh? Something must have gone wrong in these polls, right?



All polls are known to be sub-optimum, but even if we found a way to conduct a perfect poll, and this perfect poll predicted a 71.4% chance of a Clinton victory, could we now state after observing a Trump victory that the polls were wrong? No, and the reason why most of us find this difficult to grasp is that most of us don’t truly appreciate probability.

No poll that I am aware of predicted a 100% chance of a Clinton victory. All polls that I saw had a non-zero chance of a Trump victory. So, even if with our “perfect” poll we see that Trump had a 28.6% chance of winning the election, we should not be surprised with a Trump victory. You can be disgusted, saddened, and /or scared, but you should not be surprised. After all, something with a 28.6% chance of occurring has—you guessed it!—a 28.6% chance of occurring.

28.6% translates to a 1 in 3.5 chance. If you think of a 6-sided die, each number has a 1 in 6 chance of being rolled on a single roll (~16.67% chance). Before you roll the die, you expect to see any other number than a 6. Are you surprised then if when you roll the die you observe a 6? Probably not. It’s not that remarkable. Yet it is expected less than Trump’s 28.6%. Likewise, if the weather-person on TV tells you there is a 28.6% chance of rain today, are you surprised if you get caught in a shower on your lunch break? Again, probably not.

So, the polls weren’t wrong at all. All predicted a non-zero chance of a Trump victory. What was wrong was the conclusion made from the polls.

Richard Royall & “Statistical Evidence”

The above raced through my mind without a second thought when I read numerous articles claiming the polls were wrong, but it was brought into sharper focus today when I was reading Richard Royall’s (excellent) chapter “The Likelihood Paradigm for Statistical Evidence”. In this chapter, he poses the following problem. A patient is given a non-perfect diagnostic test for a disease; this test has a 0.94 probability of detecting the disease if it is present in the patient (and therefore a 0.06 probability of missing the disease when it is present). However, it also has a non-zero probability of 0.02 of producing a “positive” detection even though the disease is not present (i.e., a false-positive).

The table below outlines these probabilities of the test result for a patient who does have the disease (X = 1) and a patient who does not have the disease (X = 0).


Now a patient comes to the clinic and the test is administered. The doctor observes a positive result. What is the correct conclusion the doctor can make based on this positive test result?

  1. The person probably has the disease.
  2. The person should be treated for the disease.
  3. This test result is evidence that this person has the disease.

The person probably has the disease

Intuitively, I think most people would answer this is correct. After all, the test has a 0.94 probability of detecting the disease if present, and we have a positive test result. It’s unlikely that this is a false positive, because this only occurs with a probability of 0.02.

However, this does not take into account the prior probability of the disease being present. (Yes, I have just gone Bayesian on you.) If the disease is incredibly rare, then it turns out that there is a very small probability the patient has the disease even after observing a positive test outcome. For a nice example of how the prior probability of the disease influences the outcome, see here.

The person should be treated for the disease

It should be clear from the above that this conclusion also depends on the prior probability of the disease. If the disease is incredibly rare, the patient doesn’t likely have it (even after a positive test result), so don’t waste resources (and risk potential harm to the patient). Again, the evidence doesn’t allow us to draw this conclusion.

This test result is evidence that this person has the disease

Royall argues that this is the only conclusion one can draw from the evidence. It is subtly different from Conclusion 1, but follows naturally from the “Law of Likelihood”:

If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) is less than pB(x)…

In our “disease” example, the observation of a positive result is evidence that this person has the disease because this outcome (a positive result) is better predicted under the hypothesis of “disease present” than the hypothesis “disease absent”. But it doesn’t mean that the person probably has the disease, or that we should do anything about it.

Back to Trump

Observing a Trump victory after a predicted win of 28.6% isn’t that surprising. The polls weren’t wrong. 28.6% is a non-zero chance. We should interpret this evidence in a similar way to the disease example: These poll results are evidence that Clinton will win. It is a mistake to interpret them as “Clinton probably will win”.

Replication Crisis: What Changes Have your Department Made?

It’s been a year since the Open Science Collaboration’s publication on “Estimating the Reproducibility of Psychological Science” was published in Science. It has been cited 515 times since publication, and has been met with much discussion on social networks.

I am interested in what changes your psychology department have made since. Are staff actively encouraged to pre-register all studies? Do you ask faculty members for open data? Are faculty members asked to provide open materials? Do ethics panels check the power of planned studies? Have you embedded Open Science practices into your research methods teaching?

I am preparing a report for my department on how we can address the issues surrounding the replication crisis, and I would be very interested to hear what other departments have done to address these important issues. Please comment on this post with what your department has done!

Do Olympic Hosts Have a “Home-Field” Advantage?

My wife questioned in passing yesterday whether summer Olympic hosts have a home-field advantage; that is, do the hosts generally win more medals in their hosting year than in their non-hosting years?

That a home-field advantage exists in many team sports is generally not disputed—see for example this excellent blog post by the Freakonomics team. But is this true for (generally) individual sports like the Olympics? Most of us Brits recall our amazing—and quite unusual—3rd place finish when we hosted the event in 2012, so anecdotally I can understand why suspicion of a home-field advantage exists.

But is it real? I am quite sure there is an answer to this question on the web somewhere, but I wanted to take this opportunity to try and find an answer myself. Basically, I saw this as an excuse to learn some web-scraping techniques using R statistics.

The Data

Wikipedia holds unique pages for each Summer Olympic games. On these pages are medal tables tallying the number of Gold, Silver, and Bronze each competing nation won that year, as well as the Total Medals. So, I wrote some code in R that visits each of these pages in turn, finds the relevant html table containing the medal counts, and extracts it into my work-space. I only looked at post-2nd-world-war games.

My idea was to plot all the medals won for each host nation for all years they have appeared at the games. I was interested in whether the total number of medals that the host won in their host-year was more than their average (mean) across all the games the host had appeared. If there is some sort of home-field advantage, generally we would expect their host-year to be one of their better years, certainly above their average Olympic performance.

The Results

Below is a plot of the results. The header of each plot shows who the host was that year, and the data in each plot shows the total number of medals won by the host in all of the games they have appeared in. To help interpretation of the results, for each plot, the vertical blue line shows the year that nation hosted the games, and the horizontal red line shows that nation’s mean performance across all their games.



I would take this data as providing some evidence that nations generally perform better when they are hosting the games. 11 out of 16 nations had their best year the year they hosted the games. All nations performed above average the year they hosted the games (although maybe Canada, 1976, just missed out).

The Real Conclusion (And the Code)

Coding in R is fun, and I look for any excuse to work on new projects. This is my first attempt at doing web scraping, and it wasn’t as painful as I thought it would be. Below is the code, relying a lot on the rvest R package which I highly recommend; check out this nice introduction to using it.

The code I wrote is below. It’s certainly not optimal, and likely full of errors, but I hope someone finds it of use. Although I tried to automate every aspect of the analysis, some aspects had to be manually altered (for example to match “Soviet Union” data with “Russia” data).


# clear workspace
rm(list = ls())

# set working directory
setwd("D:/Work/Blog_YouTube code/Blog/Olympic Medals")

# load relevant packages

# suppress warnings
options(warn = -1)

### get a list of all of the host nations

# set the url and extract html elements
host_url <- ""
temp <- host_url %>%
  html %>%

# extract the relevant table
hosts <- data.frame(html_table(temp[1]))

# remove the years that the Olympics were not held
hosts <- hosts[!grepl("not held", hosts$Host.City..Country), ]

# remove the cities from the host column
countries <- hosts$Host.City..Country
countries <- gsub(".*,", "", countries)
hosts$Host.City..Country <- countries

# remove the Olympics that are ongoing (or are yet to occur) and generally
# tidy the table up. Also, only select post-1948 games.
hosts <- hosts %>%
  select(-Olympiad) %>%
  select(year = Year, host = Host.City..Country) %>%
  filter(year < 2016 & year > 1948)

# remove white space from the names
hosts$host <- gsub(" ", "", hosts$host, fixed = TRUE)

# change host England to Great Britain. 
# change SouthKorea to South Korea
# change USSR to Russia
hosts$host <- gsub("England", "Great Britain", hosts$host, fixed = TRUE)
hosts$host <- gsub("SouthKorea", "South Korea", hosts$host, fixed = TRUE)
hosts$host <- gsub("USSR", "Russia", hosts$host, fixed = TRUE)

### get the medal tables for each year and store them in one list

# get a vector of all years
years <- hosts$year

# create a list to store the medal tables
medal_tables <- list()

# loop over each year and retrieve the data from Wikipedia
for(i in 1:length(years)){
  # what is the current year?
  curr_year <- years[i]
  # construct the relevant URL to the Wikipedia page
  url <- paste("", curr_year, 
               "_Summer_Olympics_medal_table", sep = "")
  # retrieve the data from this page
  temp <- url %>%
    html %>%
  # find the html table's position. The medal table is in a "sortable" Wiki 
  # table, so we search for this term and return its position in the list
  position <- grep("sortable", temp)
  # get the medal table. Add a new column storing the year
  medals <- data.frame(html_table(temp[position], fill = TRUE))
  medals <- medals %>%
    mutate(Year = curr_year)
  # change the names of the "Nation" column, as this is not consistent between
  # games tables
  colnames(medals)[2] <- "Nation"
  # remove the weird symbols from the html file (Â)
  nations <- medals$Nation
  nations <- gsub("[^\\x{00}-\\x{7f}]", "", nations, perl = TRUE)
  # we need to change "Soviet Union" to USSR for consistency
  nations <- gsub("Soviet Union(URS)", "Russia(RUS)", nations, fixed = TRUE)
  # also change West & East Germany to "Germany"
  nations <- gsub("East Germany(GDR)", "Germany(GER)", nations, fixed = TRUE)
  nations <- gsub("West Germany(FRG)", "Germany(GER)", nations, fixed = TRUE)
  medals$Nation <- nations

  # save the medal table and move to the next games
  medal_tables[[i]] <- medals


### loop over each host, then find how many medals they won in each games and
### store it in data frame

# initialise the data frame
final_data <- data.frame(hosts)
final_data[, as.character(years)] <- 0

for(i in 1:length(hosts$host)){
  # get the current host
  curr_host <- hosts$host[i]

  # loop over all years, find the number of medals won by the current host, 
  # and store it in final_data frame
  for(j in 1:length(years)){
    # what is the current year?
    curr_year <- years[j]
    # get the medal table for the current year
    curr_medals <- medal_tables[[j]]
    # get the row for the current host if it is present
    curr_medals <- curr_medals %>%
      filter(str_detect(Nation, curr_host))
    # collate the number of medals won if there is data
    if(nrow(curr_medals) > 0){
      final_data[i, j + 2] <- sum(curr_medals$Total)
    } else
      final_data[i, j + 2] <- 0
  } # end of each year loop
} # end of each host loop

### now do some plotting
pdf("medals.pdf", width = 12, height = 12)

# change the layout of the plotting window
par(mfrow = c(4, 4))

# loop over each hosting nation
for(i in 1:nrow(final_data)){
  # get the current host's data for all years
  host_data <- as.numeric(final_data[i, 3:ncol(final_data)])
  # what is their mean number of medals won?
  host_mean <- mean(host_data)
  # plot the data!
  plot(years, host_data, xlab = "Year", ylab = "Number of Medals", pch = 19, 
       type = "b", lwd = 2, 
       main = paste(hosts$host[i], "–", years[i], sep = ""))
  abline(v = final_data$year[i], lty = "dashed", col = "blue", lwd = 1.5)
  abline(h = host_mean, lty = "dashed", col = "red", lwd = 1.5)


Solution to #BarBarPlots in R

I came across an interesting project the other day which is calling for a reconsideration of the use of bar plots (#barbarplots), with the lovely tag-line “Friends don’t let friends make bar plots!”. The project elegantly outlines convincing reasons why bar plots can be misleading, and have successfully funded a campaign to “…increase awareness of the limitations that bar plots have and the need for clear and complete data visualization”.

In this post, I want to show the limitations of bar plots that these scientists have highlighted. Then, I provide a solution to these limitations for researchers who want to continue using bar plots that can easily be cobbled together using R-statistics (with the ggplot2 package).

The Data

Say you are a researcher who collects some data (it doesn’t matter on what) from two independent groups and you are interested in whether there is a difference between them. Most researchers would maybe calculate the mean and standard error of each group to describe the data. Then the researcher might plot the data using a bar plot, together with error bars representing the standard error. To provide an inferential test on whether a difference exists, the researcher would usually conduct an independent samples t-test.

Let’s provide some example data for two conditions:

  • condition A (n = 100): mean of 200.17, a median of 196.43, and a standard error of 6.12
  • condition B (n = 100): mean of 200.11, a median of 197.87, and a standard error of 7.19

Here is the bar plot:


Pretty similar, right? The researcher sees that there is little evidence for a difference; to test this inferentially they conduct an independent samples t-test, with the outcome t(198) = 0.007, p = .995, Cohen’s d < 0.001. The researcher concludes there is no difference between the two groups.

The Problem

The problem raised by the #barbarplot campaign is that bar plots are a poor summary of the distribution of data. The bar plot above suggests there is no difference between the two groups, but the two groups are different! How do I know they are different? I simulated the data. What the bar plot hides is the shape of the underlying distribution of each data set. Below I present a density plot (basically a smoothed histogram) of the same data as above:


Now we can see that the two groups are clearly different! Condition A is a normal distribution, but condition B is bi-modal. The bar plot doesn’t capture this difference.

The Solution

Density plots are a nice solution to presenting the distribution of data, but can get really messy when there are multiple conditions (imagine the above density plot but with 4 or more overlapping conditions). Plus, researchers are used to looking at bar plots, so there is something to be said about continuing their use (especially for factorial designs). But how do we get around the problem highlighted by the #barbarplot campaign?

One solution is to plot the bar plots as usual, but to overlay the bar plot with individual data points. Doing this allows the reader to see the estimates of central tendency (i.e., to interpret the bar plot as usual), whilst at the same time allowing the reader to see the spread of data in each condition. This sounds tricky to do (and it probably is if you are still using Excel; yes, I’m talking to you!), but it’s simple if you’re using R.

Below is the above data plotted as a combined bar and point plot. As you can see, the difference in distribution is now immediately apparent, whilst retaining the advantages of a familiar bar plot. Everyone wins!



R Code

Below is the R code for the combined plot. This includes some code that generates the artificial data used in this example.

# load required packages

#--- Generate artificial data

# set random seed so example is reproducible

# generate condition A
condition <- rep("condition_A", 100)
dv_A <- rnorm(100, 200, 60)
condition_A <- data.frame(condition, dv = dv_A)

# generate condition B
condition <- rep("condition_B", 100)
dv_B <- c(rnorm(50, 130, 10), rnorm(50, 270, 10))
condition_B <- data.frame(condition, dv = dv_B)

# put all in one data frame
raw_data <- rbind(condition_A, condition_B)

# calculate sumary statistics
data_summary <- raw_data %>%
  group_by(condition) %>%
  summarise(mean = mean(dv), 
            median = median(dv),
            se = (sd(dv)) / sqrt(length(dv)))

#--- Do the "combined" bar plot
p2 <- ggplot()

# first draw the bar plot
p2 <- p2 + geom_bar(data = data_summary,
                    aes(y = mean,x = condition,
                        ymin = mean - se,
                        ymax = mean + se), fill = "darkgrey",
                    stat="identity", width=0.4)

# draw the error bars on the plot
p2 <- p2 + geom_errorbar(data = data_summary,
                         aes(y = mean, x = condition,
                             ymin = mean - se,
                             ymax = mean + se), stat = "identity",
                         width = 0.1, size = 1)

# now draw the points on the plot
p2 <- p2 + geom_point(data = raw_data, aes(y = dv, x = condition),
                      size = 3, alpha = 0.3,
                      position = position_jitter(width = 0.3, height = 0.1))

# scale and rename the axes, and make font size a bit bigger
p2 <- p2 + coord_cartesian(ylim = c(50, 400))
p2 <- p2 + scale_x_discrete(name = "Condition") +
  scale_y_continuous(name = "DV")

p2 <- p2 + theme(axis.text = element_text(size = 12),
                 axis.title = element_text(size = 14,face = "bold"))

# view the plot


10 Recommendations from the Reproducibility Crisis in Psychological Science

This week I gave an internal seminar at my institution (Keele University, UK) entitled “Ten Recommendations from the Reproducibility Crisis in Psychological Science”. The audience was to be faculty members and psychology graduate students. My aim was to collate some of the “best-practices” that have emerged over the past few years and provide direct advice for how researchers and institutions can adapt their research practice. It was hard to come up with just 10 recommendations, but I finally decided on the following:

  1. Replicate, replicate, replicate
  2. Statistics (i): Beware p-hacking
  3. Statistics (ii): Know your p-values
  4. Statistics (iii): Boost your power
  5. Open data, open materials, open analysis
  6. Conduct pre-registered confirmatory studies
  7. Incorporate open science practices in teaching
  8. Insist on open science practices as reviewers
  9. Reward open science practices (Institutions)
  10. Incorporate open science into hiring decisions (Institutions)

The link to the slides are below. I might expand upon this in a fuller blog post in time, if there is interest.



“Bayesian in 8 Easy Steps” Journal Club

I’ve been trying to organise an online journal club to discuss the papers suggested in Alexander Etz and colleagues’ paper “How to become a Bayesian in 8 easy steps”. Several people have filled out the Doodle poll expressing an interest, but unfortunately not everyone can make the same time. As such, I am afraid I will have to go with the time which the majority of people can make. I am sorry that this will leave some people out.

The most popular day & time was Thursdays at 1pm UTC. Therefore, I propose the first meeting be on Thursday 10th March at 1pm. It will be on Google Hangouts, but I need to spend some time working out how to use this before I pass on details of the meet.

Please complete the following Doodle poll to indicate whether you can make this meeting or not. Please also provide your email address together with your name in the poll. Then, I can move the discussion to email rather than having to post on my blog for updates.

See you there!