Bayesian

Bayesian Estimation of Partial Correlations

Correlations are a popular analysis tool in psychology to examine the extent to which two variables are related. For example, one might be interested in whether there is a relationship between shoe size and height. But what if you want to explore the relationship between two variables whilst controlling for the effects of a third variable? For example, you might be interested in the relationship between shoe size and height whilst controlling for age. In order to do this, one needs to use partial correlations.

Whilst there are relatively straightforward methods to calculate partial correlations using frequentist statistics, I was interested recently whether there is a Bayesian parameter estimation version of this analysis. A cursory glance of the internet didn’t bring up much1, so I set about developing my own.

Below, I briefly cover the frequentist implementation before providing an overview of the Bayesian parameter estimation version.

Frequentist Implementation

Assume we have three variables, X, Y, and Z. We are interested in the relationship between X and Y, but need to account for the effect of Z. I simulated 75 data points (these could be participants in a study), and calculation of a standard Pearson’s correlation between X and Y provides rx,y(74) = 0.48, p<0.05. A “significant” relationship! However, this doesn’t take into account the effect of Z.

Examination of all possible correlations between X, Y, and Z reveals a strong relationship between Z and the other two variables:

  • rx,z(74) = 0.58, p<0.05.
  • ry,z(74) = 0.62, p<0.05.

Examining the relationship between X and Y whilst controlling for Z (denoted here as *rx,y|z*) is given by the following formula:

r_{x,y|z} = \frac{r_{x,y} - (r_{x,z} \times r_{y,z})}{\sqrt{1 - r_{x,z}^{2}} \times \sqrt{1 - r_{y,z}^{2}}}

When calculating this partial correlation between X and Y, controlling for Z, we get rx,y|z(74) = 0.19, p = .10. Note now that the relationship between X and Y is no longer “significant”. The Figure below shows the relationship with Regular (i.e., Pearson’s) and Partial correlation.

Bayesian Parameter Estimation

This is all well and good, but what if you want a Bayesian estimate of the partial correlation parameter? I had a quick check of the internet, and couldn’t see anything. (I should say that I didn’t spend long looking; I like to do some things from scratch just so I can learn something, and this was one of those times.) Inspired by the SUPERB book by Lee & Wagenmakers on Bayesian modelling, I devised a Bayesian graphical model of partial correlation:

 

This model extended the one presented in the Lee & Wagenmakers book which was used to calculate the correlation coefficient between just two variables. There are two extensions in the current model over that of Lee & Wagenmakers:

  • The model is extended to accommodate three variables of interest instead of two. As such, three correlation parameters are estimated from the data: rx,y, rx,z, and ry,z.2
  • The model has a new parameter, \theta , which denotes the partial correlation of interest in the current example, that is rx,y|z

The model assumes—as does the one by Lee & Wagenmakers—that the data are modelled as draws from a multivariate normal distribution. The parameters of this distribution are the means of each of the three variables (denoted \mu_{1}, \mu_{2},\mu_{3},  ) and their standard deviations (\sigma_{1}, \sigma_{2},\sigma_{3},  ), as well as the correlation coefficients that link them (r_{1,2}, r_{1,3}, r_{2,3},  ). I use broad, rather uninformative priors (which of course can be tweaked later if one wishes).

The parameter of interest, \theta , inherits the distributions from the model parameters pertaining to the correlation coefficients (r_{1,2}, r_{1,3}, r_{2,3},  ), and from them generates a new distribution of the partial correlation parameter (according to the first equation in this post). The distribution of interest for inference is now this new distribution pertaining to the partial correlation parameter.

Results

Recall that the frequentist estimate of the partial correlation coefficient was rx,y|z(74) = 0.19. Below is a density plot of the posterior distribution of the \theta parameter from the Bayesian graphical model above3.

The mode of this posterior distribution was 0.205, with a 95% Highest-Density Interval spanning from -0.03 to 0.39. Note that whilst the modal estimate of the partial correlation parameter was close to that of the frequentist analysis, the Bayesian parameter estimation provides much more information, in particular regarding the uncertainty of this estimate.

Conclusion

I am not sure if this implementation is correct, so I advise using it with caution. But, it proved an interesting exercise, and I am not aware of a current implementation of this.

Code

#------------------------------------------------------------------------------
### initial set up
rm(list = ls())

# set working directory
setwd("D:/Work/Other/Blog_YouTube code/Blog/Bayesian Partial Correlation")

# load libraries
library(ppcor)
library(Hmisc)
library(R2jags)

# set seed for reproducible code
set.seed(42)
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
### define functions

# function to simulate partially correlated data
get_data <- function(n){
  x <- rnorm(n, 0, 1)
  y <- .5 * x + rnorm(n, 0, 1)
  z <- .3 * x + .6 * y + rnorm(n, 0, 1)
  
  data <- data.frame(x = x, y = y, z = z)
  
  return(data)
}
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
### get data & conduct frequentist analysis

# generate the data
n <- 75
x <- get_data(n)

## plot the data with linear model fits
op = par(cex.main = 1.5, mar = c(5,6,4,5) + 0.1, mgp = c(3.5, 1,0), 
         cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las = 1)


# do pairs plot
pdf("pairs.pdf", width = 6, height = 6)
pairs(x, upper.panel = NULL, pch = 19)
dev.off()

# do correlation plot
pdf("correlation.pdf", width = 6, height = 6)
plot(x$y, x$x, pch = 17, ylab = "", xlab = "")

# model not controlling for z
mod_1 <- lm(x$x ~ x$y)
abline(a = mod_1$coefficients[1], b = mod_1$coefficients[2], lwd = 3, lty = 1, 
       col = "red")

# model controlling for z
mod_2 <- lm(x$x ~ x$y + x$z)
abline(a = mod_2$coefficients[1], b = mod_2$coefficients[2], lwd = 3, lty = 1, 
       col = "blue")
legend("bottomright", c("Regular", "Partial"), lty = c(1, 1), bty = "n", 
       lwd = 3, col = c("red","blue"), cex = 1.5)
dev.off()

# get the frequentist estimate of correlation & partial correlation
# note that the correlation between x and y is no longer significant when
# controlling for z
freq_r <- rcorr(as.matrix(x))
freq_partial_r <- pcor(x)
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
### Conduct Bayesian parameter estimation

# declare the JAGS model code
model_code <- "
model {

# data
for(i in 1:n){
x[i, 1:3] ~ dmnorm.vcov(mu[], TI[,])
}

# priors
mu[1] ~ dnorm(0, .001)
mu[2] ~ dnorm(0, .001)
mu[3] ~ dnorm(0, .001)

lambda[1] ~ dgamma(.001, .001)
lambda[2] ~ dgamma(.001, .001)
lambda[3] ~ dgamma(.001, .001)

r_xy ~ dunif(-1, 1)
r_xz ~ dunif(-1, 1)
r_yz ~ dunif(-1, 1)

# reparameterisation
sigma[1] <- 1/sqrt(lambda[1])
sigma[2] <- 1/sqrt(lambda[2])
sigma[3] <- 1/sqrt(lambda[3])

T[1, 1] <- 1 / lambda[1]
T[1, 2] <- r_xy * sigma[1] * sigma[2]
T[1, 3] <- r_xz * sigma[1] * sigma[3]

T[2, 1] <- r_xy * sigma[1] * sigma[2]
T[2, 2] <- 1 / lambda[2]
T[2, 3] <- r_yz * sigma[2] * sigma[3]

T[3, 1] <- r_xz * sigma[1] * sigma[3]
T[3, 2] <- r_yz * sigma[2] * sigma[3]
T[3, 3] <- 1 / lambda[3]

TI[1:3, 1:3] <- T[1:3, 1:3]

# partial correlation calculation
num <- r_xy - (r_xz * r_yz)
denom <- sqrt(1 - pow(r_xz, 2)) * sqrt(1 - pow(r_yz, 2))
partial <- num/denom
}
"

# model details
jags_info <- list("x", "n")
parameters <- c("r_xy", "r_xz", "r_yz", "mu", "sigma", "partial")

# fit the model
sample <- jags(jags_info, inits = NULL, parameters, 
               model.file = textConnection(model_code), n.chains = 1, 
               n.iter = 10000, n.burnin = 500, n.thin = 5, DIC = F)

# look at the overview of the parameter estimates
sample


# extract the posterior samples of the partial correlation (4th column) &
# calculate the 95% HDI
posterior <- sample$BUGSoutput$sims.matrix[, 4]
sample_mcmc <- as.mcmc(posterior)
hdi <- HPDinterval(sample_mcmc)



###  plot

# do some preparation for plotting by finding the mode of the posterior
dens <- density(posterior)
posterior_mode <- dens$x[which.max(dens$y)]

# do the plot
pdf("bayesian_estimate.pdf", width = 6, height = 6)
plot(dens, xlim = c(-1, 1), ylim = c(0, max(dens$y) + 0.55), 
     main = "", xlab = "Partial Correlation Estimate", lwd = 2)

# add the mode of the sample & the HDI etc.
lines(x=c(posterior_mode, posterior_mode), y=c(2.5, 3.8), lty = 2, lwd = 2, 
      col = "red")
text(x= posterior_mode, y = max(dens$y) + 0.5, paste("Posterior mode =",
                                         round(posterior_mode, 3), sep = " "), 
     cex = 1.2, col = "red")
lines(x = c(hdi[1],hdi[1]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[2],hdi[2]), y = c(0,0.2), lwd = 2, col = "red")
lines(x = c(hdi[1],hdi[2]), y = c(0.1,0.1), lwd = 2, col = "red")
text(x = (hdi[1] + hdi[2]) / 2, y = 0.325, "95% HDI", cex = 1.2, col = "red")
dev.off()
#------------------------------------------------------------------------------

  1. Note that there is a paper from Wetzel & Wangenmakers (2012) which demonstrates the calculation of Bayes factors for correlation and partial correlation using summary statistics (i.e., Pearson’s r and the degrees of freedom). Note also (as I say in the main post) that I didn’t search that hard for a solution to this problem as I was keen to make my own method. So, there is probably a better way of doing this that is already available. 
  2. Note that in the model of Lee & Wagenmakers with just two variables, one must take the inverse of the variance–covariance matrix when passing it to the dmnorm function in JAGS. With three or more variables, there are issues when taking the inverse of the matrix because it is then not positive definite. (See this link for the help I received from the JAGS owner on this issue) The solution involves using JAGS versions newer than 4.3.0, and using the dmnorm.cov function instead. 
  3. As this is just a toy example, I kept the model-fit time as quick as possible. I generated 10,000 samples from the posterior distributions, treating the first 500 as burn-in samples. The thinning rate was set to 5, and just one chain was used. Note that this is not optimal, but the fit time was quite slow for the model. 
Advertisements

The Polls Weren’t Wrong

TL;DR: Trump had a 28% chance to win. We shouldn’t be surprised he won.

I’m not going to comment on the political outcome of last week’s US Presidential elections; enough ink—both pen-ink and eye-ink—has been spilled about that. What I am going to comment on though is the growing feeling the polls were wrong, and why an understanding of probability and statistical evidence might lead us to a more positive conclusion (or at least, less negative).

After last week’s “shock” result—read on for why shock is in scare-quotes—many news articles began to ask “Why were the polls wrong?”. (For a list, see the Google search here). This question is largely driven by the fact influential pollsters heavily favoured a Clinton victory. For example, FiveThirtyEight’s polls predicted a 71.4% chance of a Clinton victory. The New York Times predicted an 85% chance of a Clinton victory. Pretty convincing, huh? Something must have gone wrong in these polls, right?

538

 

All polls are known to be sub-optimum, but even if we found a way to conduct a perfect poll, and this perfect poll predicted a 71.4% chance of a Clinton victory, could we now state after observing a Trump victory that the polls were wrong? No, and the reason why most of us find this difficult to grasp is that most of us don’t truly appreciate probability.

No poll that I am aware of predicted a 100% chance of a Clinton victory. All polls that I saw had a non-zero chance of a Trump victory. So, even if with our “perfect” poll we see that Trump had a 28.6% chance of winning the election, we should not be surprised with a Trump victory. You can be disgusted, saddened, and /or scared, but you should not be surprised. After all, something with a 28.6% chance of occurring has—you guessed it!—a 28.6% chance of occurring.

28.6% translates to a 1 in 3.5 chance. If you think of a 6-sided die, each number has a 1 in 6 chance of being rolled on a single roll (~16.67% chance). Before you roll the die, you expect to see any other number than a 6. Are you surprised then if when you roll the die you observe a 6? Probably not. It’s not that remarkable. Yet it is expected less than Trump’s 28.6%. Likewise, if the weather-person on TV tells you there is a 28.6% chance of rain today, are you surprised if you get caught in a shower on your lunch break? Again, probably not.

So, the polls weren’t wrong at all. All predicted a non-zero chance of a Trump victory. What was wrong was the conclusion made from the polls.

Richard Royall & “Statistical Evidence”

The above raced through my mind without a second thought when I read numerous articles claiming the polls were wrong, but it was brought into sharper focus today when I was reading Richard Royall’s (excellent) chapter “The Likelihood Paradigm for Statistical Evidence”. In this chapter, he poses the following problem. A patient is given a non-perfect diagnostic test for a disease; this test has a 0.94 probability of detecting the disease if it is present in the patient (and therefore a 0.06 probability of missing the disease when it is present). However, it also has a non-zero probability of 0.02 of producing a “positive” detection even though the disease is not present (i.e., a false-positive).

The table below outlines these probabilities of the test result for a patient who does have the disease (X = 1) and a patient who does not have the disease (X = 0).

royall

Now a patient comes to the clinic and the test is administered. The doctor observes a positive result. What is the correct conclusion the doctor can make based on this positive test result?

  1. The person probably has the disease.
  2. The person should be treated for the disease.
  3. This test result is evidence that this person has the disease.

The person probably has the disease

Intuitively, I think most people would answer this is correct. After all, the test has a 0.94 probability of detecting the disease if present, and we have a positive test result. It’s unlikely that this is a false positive, because this only occurs with a probability of 0.02.

However, this does not take into account the prior probability of the disease being present. (Yes, I have just gone Bayesian on you.) If the disease is incredibly rare, then it turns out that there is a very small probability the patient has the disease even after observing a positive test outcome. For a nice example of how the prior probability of the disease influences the outcome, see here.

The person should be treated for the disease

It should be clear from the above that this conclusion also depends on the prior probability of the disease. If the disease is incredibly rare, the patient doesn’t likely have it (even after a positive test result), so don’t waste resources (and risk potential harm to the patient). Again, the evidence doesn’t allow us to draw this conclusion.

This test result is evidence that this person has the disease

Royall argues that this is the only conclusion one can draw from the evidence. It is subtly different from Conclusion 1, but follows naturally from the “Law of Likelihood”:

If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) is less than pB(x)…

In our “disease” example, the observation of a positive result is evidence that this person has the disease because this outcome (a positive result) is better predicted under the hypothesis of “disease present” than the hypothesis “disease absent”. But it doesn’t mean that the person probably has the disease, or that we should do anything about it.

Back to Trump

Observing a Trump victory after a predicted win of 28.6% isn’t that surprising. The polls weren’t wrong. 28.6% is a non-zero chance. We should interpret this evidence in a similar way to the disease example: These poll results are evidence that Clinton will win. It is a mistake to interpret them as “Clinton probably will win”.

“Bayesian in 8 Easy Steps” Journal Club

I’ve been trying to organise an online journal club to discuss the papers suggested in Alexander Etz and colleagues’ paper “How to become a Bayesian in 8 easy steps”. Several people have filled out the Doodle poll expressing an interest, but unfortunately not everyone can make the same time. As such, I am afraid I will have to go with the time which the majority of people can make. I am sorry that this will leave some people out.

The most popular day & time was Thursdays at 1pm UTC. Therefore, I propose the first meeting be on Thursday 10th March at 1pm. It will be on Google Hangouts, but I need to spend some time working out how to use this before I pass on details of the meet.

Please complete the following Doodle poll to indicate whether you can make this meeting or not. Please also provide your email address together with your name in the poll. Then, I can move the discussion to email rather than having to post on my blog for updates.

http://doodle.com/poll/7im5vnk9cddc3vyb

See you there!

(Pesky?) Priors

When I tell people I am learning Bayesian statistics, I tend to get one of two responses: either people look at me blankly—“What’s Bayesian statistics?”—or I get scorned for using such “loose” methods—“Bayesian analysis is too subjective!”1. This latter “concern” arises due to (what I believe to be a misunderstanding of) the prior: Bayesian analysis requires one state what one’s prior belief is about a certain effect, and then combine this with the data observed (i.e., the likelihood) to update one’s belief (the posterior).

On the face of it, it might seem odd for a scientific method to include “subjectivity” in its analysis. I certainly had this doubt when I first started learning it. (And, in order to be honest with myself, I still struggle with it sometimes.) But, the more I read, the more I think this concern is not warranted, as the prior is not really “subjectivity” in the strictest sense of the word at all: it is based on our current understanding of the effect we are interested in, which in turn is (often) based on data we have seen before. Yes, sometimes the prior can be a guess if we2 have no other information to go on, but we would express the uncertainty of a belief in the prior itself.

The more I understand Bayesian statistics, the more I appreciate the prior is essential. One under-stated side-effect of having priors is that it can protect you from dubious findings. For example, I have a very strong prior against UFO predictions; therefore, you are going to have to present me with a lot more evidence than some shaky video footage to convince me otherwise. You would not have to provide me with much evidence, however, if you claimed to have roast beef last night. Extraordinary claims require extraordinary evidence.

But, during my more sceptical hours, I often succumbed to the the-prior-is-nothing-but-subjectivity-poisoning-your-analysis story. However, I now believe that even if one is sceptical of the use of a prior, there are a few things to note:

  • If you are concerned your prior is wrong and is influencing your inferences, just collect more data: A poorly-specified prior will be washed away with sufficient data.

  • The prior isn’t (really) subjective because it would have to be justified to a sceptical audience. This requires (I suggest) plotting what the prior looks like so readers can familiarise themselves with your prior. Is it really subjective if I show you what my prior looks like and I can justify it?

  • Related to the above, the effect of the prior can be investigated using robustness checks, where one plots the posterior distribution based on a range of (plausible) prior values. If your conclusions don’t depend upon the exact prior used, what’s the problem?

  • Priors are not fixed. Once you have collected some data and have a posterior belief, if you wish to examine the effect further you can (and should) use the posterior from the previous study as your prior for the next study.

These are the points I mention to anti-Bayesians I encounter. In this blog I just wanted to skip over some of these with examples. This is selfish; it’s not really for your education (there really are better educators out there: My recommendation is Alex Etz’s excellent “Understanding Bayes” series, from where this blog post takes much inspiration!). I just want somewhere with all of this written down so next time someone criticises my interest in Bayesian analysis I can just reply: “Read my blog!”. (Please do inform me of any errors/misconceptions by leaving a comment!)

As some readers might not be massively familiar with these issues, I try to highlight some of the characteristics of the prior below. In all of these examples, I will use the standard Bayesian “introductory tool” of assessing the degree of bias in a coin by observing a series of flips.

A Fair Coin

If a coin is unbiased, it should produce roughly equal heads and tails. However, often we don’t know whether a coin is biased or not. We wish to estimate the bias in the coin (denoted theta) by collecting some data (i.e., by flipping the coin); a fair coin has a theta = 0.5. Based on this data, we can calculate the likelihood of various theta values. Below is the likelihood function for a fair coin.

fairCoin-1

In this example, we flipped the coin 100 times, and observed 50 heads and 50 tails. Note how the peak of the likelihood is centered on theta = 0.5. A biased coin would have a true theta not equal to 0.5; theta closer to zero would reflect a bias towards tails, and a theta closer to 1 would reflect a bias towards heads. The animation below demonstrates how the likelihood changes as the number of observed heads (out of 100 flips) increases:

likelihood

So, the likelihood contains the information provided by our sample about the true value for theta.

The Prior

Before collecting data, Bayesian analysts would specify what their prior belief was about theta. Below I present various priors a Bayesian may have using the beta distribution (which has two parameters: a and b):betas-1

The upper left plot reflects a prior belief that the coin is fair (i.e., the peak of the distribution is centered over theta = 0.5); however, there is some uncertainty in this prior as the distribution has some spread. The upper right plot reflects total uncertainty in a prior belief: that is, the prior holds that any value of theta is likely. The lower two plots reflect prior beliefs that the coin is biased. Maybe the researcher had obtained the coin from a known con-artist. The lower left plot reflects a prior for a biased coin, but uncertainty about which side the coin is biased towards (that is, it could be biased heads or tails); the lower right plot reflects a prior that the coin is biased towards heads.

The effect of the prior

I stated above that one of the benefits of the prior is that it allows protection (somewhat) from spurious findings. If I have a really strong prior belief that the coin is fair, 9/10 heads isn’t going to be all that convincing evidence that it is not fair. However, if I have a weak prior that the coin is fair, then I will be quite convinced by the data.

This is illustrated below. Both priors below reflect the belief that the coin is fair; what differs between the two is the strength in this belief. The prior on the left is quite a weak belief, as the distribution (although peaked at 0.5) is quite spread out. The prior on the right is a stronger belief that the coin is fair.

In both cases, the likelihood is the result of observing 9/10 heads.

priorStrength-1

You can see that when the prior is a weak belief, the posterior is very similar to the likelihood; that is, the posterior belief is almost entirely dictated by the data. However, when we have a strong prior belief, our beliefs are not altered much by observing just 9/10 heads.

Now, I imagine that this is the anti-Bayesian’s point: “Even with clear data you haven’t changed your mind.” True. Is this a negative? Well, imagine instead this study was assessing the existence of UFOs rather than simple coin flips. If I showed you 9 YouTube videos of UFO “evidence”, and 1 video showing little (if any) evidence, would you be convinced of UFOs? I doubt it. You were the right-hand plot in this case. (I know, I know, the theta distribution doesn’t make sense in this case, but ignore that!)

What if the prior is wrong?

Worried that your prior is wrong3, or that you cannot justify it completely? Throw more data at it. (When is this ever a bad idea?) Below are the same priors, but now we flip the coin 1,000 times and observe 900 heads. (Note, the proportion heads is the same in the previous example.) Now, even our strong prior belief has to be updated considerably based on this data. With more data, even mis-specified priors do not affect inference.

priorStrengthSampleSize-1

To get an idea of how sample size influences the effect of the prior on the posterior, I created the below gif animation. In it, we have a relatively strong (although not insanely so) prior belief that the coin is biased “heads”. Then, we start flipping the coin, and update the posterior after each flip. In fact, this coin is fair, so our prior is not in accord with (unobservable) “reality”. As flips increases, though, our posterior starts to match the likelihood in the data. So, “wrong” priors aren’t really a problem. Just throw more data at it.

increasingN

“Today’s posterior is tomorrow’s prior” — Lindley (1970)

After collecting some data and updating your prior, you now have a posterior belief of something. If you wish to collect more data, you do not use your original prior (because it no longer reflects your belief), but you instead use the posterior from your previous study as the prior for your current one. Then, you collect some data, update your priors into your posteriors…and so on.

In this sense, Bayesian analysis is ultimately “self-correcting”: as you collect more and more data, even horrendously-specified priors won’t matter.

In the example below, we have a fairly-loose idea that the coin is fair—i.e., theta = 0.5. We flip a coin 20 times, and observe 18 heads. Then we update to our posterior, which suggests the true value for theta is about 0.7 ish. But then we wish to run a second “study”; we use the posterior from study 1 as our prior for study 2. We again observe 18 heads out of 20 flips, and update accordingly.

examplePosteriorToPrior-1

Conclusion

One of the nicest things about Bayesian analysis is that the way our beliefs should be updated in the face of incoming data is clearly (and logically) specified. Many peoples’ concerns surround the prior. I hope I have shed some light on why I do not consider this to be a problem. Even if the prior isn’t something that should be “overcome” with lots of data, it is reassuring to know for the anti–Bayesian that with sufficient data, it doesn’t really matter much.

So, stop whining about Bayesian analysis, and go collect more data. Always, more data.

Click here for the R code for this post


  1. Occasionally (althought his is happening more and more) I get a slow, wise, agreeing nod. I like those. 
  2. I really wanted to avoid the term “we””, as it implies I am part of the “in-group”: those experts of Bayesian analysis who truly appreciate all of its beauty, and are able to apply it to all of their experimental data. I most certainly do not fall into this camp; but I am trying. 
  3. Technically, it cannot be “wrong” because it is your belief. If that belief is justifiable, then it’s all-game. You may have to update your prior though if considerable data contradict it. But, bear with me. 

Animating Robustness-Check of Bayes Factor Priors

Today I submitted a paper which contained some Bayesian analysis using Bayes factors (a default Bayesian t-test). As the test used a default prior on the effect size (a Cauchy distribution centred on zero with rate [r] = 0.707), I wanted to appease any reviewer concern that the final Bayes factor reported was some peculiarity of the prior rather than due to the data.

So, I performed a standard robustness check on the prior where I plot the resulting Bayes factor from a wide range of prior value (in this case, across a wide range of Cauchy rate parameters). The plot below is the result:paperRobust

The plot shows the Bayes factor with the default prior as the black dot; the line represents how the Bayes factor changes as a function of the Cauchy prior width. This analysis suggests the final conclusion is not a peculiarity of the prior, and as such the reported Bayes factor is quite robust.

What this plot doesn’t show, though, is what the prior distribution looks like as the rate (r) changes. Ideally, I wanted to show what the Cauchy distribution looks like across the entire range of r values explored in the plot above; this helps me (and I guess other people) understand how the prior is influencing the Bayes factor. Visualisation is very important sometimes.

So, I knocked up some R code which plots the robustness check above, but plots next to it the density of the Cauchy distribution with the rate parameter currently being explored. The animation cycles through increasing values for rate r, and shows the density for this rate on the left, and the Bayes factor for my data with this prior on the right. For me, this animation provides a nice insight to how the Bayes factor is being influenced by the prior.

output_ZHhAVd

R Code

The code for generating the individual plots is below. Note that all of these plots then need to be fed into your favourite gif-making device! (I used http://gifmaker.me/).

Recruitment Order & Sequential Bayes Factors

One advantage of using Bayes factors (or any other form of Bayesian analysis) is the ability to engage in optional stopping. That is, one can collect some data, perform the critical Bayesian test, and stop data collection once a pre-defined criterion has been obtained (e.g., until “strong” evidence has been found in support of one hypothesis over another). If the criterion has not been met, one can resume data collection until it has.

Optional stopping is problematic when using p-values, but is “No problem for Bayesians!” (Rouder, 2014; see also this blog post from Kruschke). A recent paper in press at Psychological Methods (Shoenbrodt et al., in press) shows you how to use optional stopping with Bayes factors—so-called “sequential Bayes factors” (SBF).

The figure below a recent set of data from my lab where we used SBF for optional stopping. Following the advice of the SBF paper, we set our stopping criteria as N > 20 and BF10 > 6 (evidence in favour of the alternative) or BF10 < 1/6 (evidence in favour of the null); these criteria are shown as horizontal dashed lines. The circles represent the progression of the Bayes factor (shown in log scale) as the sample size increased. We stopped at 76 participants, with “strong” evidence supporting the alternative hypothesis.bf

One thing troubled me when I was writing these results up: I could well imagine a skeptical reviewer noticing that the Bayes factor seemed to be in support of the null between sample sizes of 30 and 40, although the criterion for stopping was not reached. A reviewer might wonder, therefore, whether had some of the later participants who showed no effect (see e.g., participants 54–59) been recruited earlier, would we have stopped data collection and found evidence in favour of the null? Put another way, how robust was our final result to the order in which our sample were recruited to the experiment?

Assessing the Effect of Recruitment Order

I was not aware of any formal way to assess this question, so for the purposes of the paper I was writing I knocked up a quick simulation. I randomised the recruitment order of our sample, and plotted the sequential Bayes factor for this “new” recruitment order. I performed this 50 times, plotting the trajectory of the SBF. I reasoned that if our conclusion was robust against recruitment order, the stopping rule in favour of the null should be rarely (if ever) met. The results of this simulation are below. As can be seen, the stopping criterion for the null was never met, suggesting our results were robust to recruitment order.

order

 

Recruitment Order Simulation

Then, I became curious: How robust are SBFs to recruitment order when we know what the true effect is? I simulated a set of 100 participants for an effect known to be small (d = 0.3), and plotted the SBF as “participants” were recruited. I tried different random seeds until I found a pattern I was interested in: I wanted a set of data where the SBF comes very close to the stopping rule in favour of the null, even though we know the alternative is true. It took me a few tries, but eventually I got the plot below.bfSim

Note the dip toward the null criterion is even more extreme than in my experiment. Thus, this set of data seemed perfect to test my method for assessing the robustness of the final conclusion (“strong evidence for alternative”) against recruitment order. I followed the same procedure as before, simulating 50 random recruitment orders, and plotted the results below.orderSim

The null criterion is only met in one simulated “experiment”. This seems to be very pleasing, at least for the paper I am due to submit soon: the SBF (at least in these two small examples) seem robust against recruitment order.

R Code

If you would like to see the R code for the above plots, then you can download the html file for this post which includes R code.

Polls: Why sample size matters

The people of Scotland are currently deciding in an upcoming “Yes-No” referendum (this Thursday!) whether to leave the British Union and become an independent nation. As is typical around times of heightened political interest, media outlets are awash with polls attempting to predict the outcome.  Recent polls have put the “No” response at about 52%, although this is reported to be reducing. 

What frustrates me about the reporting of these polls is that quite often—always?—the media do not inform us the sample size of the poll. It’s all well and good stating the “No” campaign has a 52% lead, but I don’t care about the point estimate (i.e., 52%); I care about how much error there is in this estimate, and—as a consequence—how this influences my confidence in the estimate of 52%. Large error means the point estimate is next-to useless; small error means it is likely more believable. The degree of error of the estimate tends to reduce as sample size increases, so without this information, I cannot assess how much error there is likely to be in the estimate.

Sample Statistics & Credible Intervals

In statistics, we are interested in what’s called a population parameter, which (roughly) corresponds to the “true” value of a test of interest if we were able to sample the whole population. This is certainly what pollsters are interested in: they want to predict the outcome of the referendum.

Of course, this is never possible, so we take samples from the main population, and on the basis of the results obtained from this sample, estimate the true population parameter. Thus, our interest is in the population parameter, and we only use the sample parameter as a proxy to estimate what really interests us. The sample parameter is our point estimate from our sample, and we hope that it is representative of the true population parameter.

It’s very unlikely that the sample estimate exactly matches the population parameter, so ideally we would like to see a range of credible estimates of the true population parameter (a credible interval), rather than just a point estimate. For example, how confident can we be that the “true” result will be 52% given our sample responded 52%? How much more likely is 51% to be a true estimate? What about 49% (which would completely change the result of the referendum)? It’s impossible to tell from a point estimate. 

I wanted to demonstrate how the size of the sample used can influence our confidence in the point estimates presented in these polls. To do this, I used R and Bayesian estimation to run some quick simulations of Yes/No polls whilst varying the total number of “people” asked. What I was interested in was how the confidence in the estimation of the population parameter varied as sample size increased. 

The results are shown below. If you would like more detail about the simulation (and Bayesian estimation techniques), please see the end of the post. Disclaimer: I am no Bayesian expert, so if there are errors below I would love to hear about them. The message remains the same, though, even if the details might differ slightly.

Results

The results for each sample size are shown in the Figure below. The sample size is shown at the top of each plot (“N = sample size”). polls

 

These plots show probability density functions across all possible parameters (proportion = N[o], on the x-axis). It shows how we should assign credibility across the parameter space, given our prior beliefs and our data (see simulation details for these). Parameters that fall within the “bulk” of the distribution are far more likely to be the “true” parameter than parameters that fall outside the bulk of the distribution. Thus, Bayesian analysis provides us with a way to assess the range of credible parameters by finding those values which 95% of the density falls over. This is shown as red line in each of the above plots, and is known as the 95% Highest-Density Interval (HDI). 

Values that fall within the HDI are more credible estimates of the true population parameter than those values that fall outside of it. Narrow HDIs suggest we have been able to provide a very precise estimate of the true parameter; wide HDIs reflect uncertainty in our estimates of the true parameter. 

Remember, the value I used to simulate this data was Proportion N[o] = .52 (52%). We see that, for all sample sizes, the median of the distribution (i.e., the peak of the distribution) is roughly above this figure. Thus, the “point estimate” for most of these plots would coincide with the pollsters with 52%.

But, look at the width of the density function (and the red HDI) as sample size increases. For only 50 people being polled, the HDI runs from 0.38 to 0.68 (see top of figure for numbers). This means that our best estimate of the percentage of people saying “NO” ranges from 38% (landslide for the Yes campaign) to 65% (landslide for the No campaign). I’m not sure about you, but I’m not very confident that 52% is a very good estimate here, as there is such wild uncertainty. 

On the contrary, with 2000 respondents, the HDI is much more narrow (because the density is much narrower); this reflects a very precise estimate of the population parameter; but even then, there is a 4% window of credible outcomes. That is, the true response could be anywhere from 50% (ambivalent) to 54% (clear edge to No campaign). 

Take-Home Message

Most, if not all, of the polls being conducted are sufficiently sampled. What gets my goat though is not the polls, but how they are reported. And this is not just true for this campaign. How many times have you seen an advert stating “…58% of people agree…”. My response is always the same: Without information on the sample size, I just don’t trust the percentage. 

Simulation Overview

If you would like the R-code I used, please shoot me an email! It uses the scripts presented in Lee & Wagenmakers’ (2014) SUPERB book “Bayesian Cognitive Modelling”. 

  1. I am interested in estimating the proportion of people responding “No”; I fixed this at the most recent estimate of 52%. Therefore, 52% is the “true” population parameter we are trying to estimate, and I simulated data using this percentage.
  2. I varied the sample size of the simulated poll from 50 (ridiculously small) up to 2000 (pretty large) in various steps. In each simulated sample size, I fixed it so that 52% responded “No”.  
  3. I used Bayesian parameter estimation to determine—for each sample size—our best estimate of the population parameter. Beyesian estimation provides a complete probability distribution over the entire parameter space (called the posterior distribution), which basically shows how we should assign credibility across all of the possible parameter values , given our prior expectations and given the data. Wide posterior distributions suggest a large number of possible parameters are candidates to be the “true” parameter, whereas narrow posterior distributions suggest we have narrowed down the possibilities considerably. 
  4.  The prior is how much probability we provided to each parameter value before we saw any data. In our example, we need to provide a probability distribution over all possible percentages of responding “No” based on our expectations. For the simulation, I set the prior to be a uniform distribution over all parameters (a beta[1,1] for those interested; that is, the prior was very neutral in terms of our expectations. 
  5. Then, I set the responses to 52% of the sample size. This is our data. I modelled responding as being a binomial process, with k = proportion responding “No”, and n = total sample size. 
  6. Based on our prior and data, I used rjags to conduct the analysis, which produces one probability density function across all possible parameter values for each sample size. This is shown in the Figure presented.