# Polls: Why sample size matters

The people of Scotland are currently deciding in an upcoming “Yes-No” referendum (this Thursday!) whether to leave the British Union and become an independent nation. As is typical around times of heightened political interest, media outlets are awash with polls attempting to predict the outcome.  Recent polls have put the “No” response at about 52%, although this is reported to be reducing.

What frustrates me about the reporting of these polls is that quite often—always?—the media do not inform us the sample size of the poll. It’s all well and good stating the “No” campaign has a 52% lead, but I don’t care about the point estimate (i.e., 52%); I care about how much error there is in this estimate, and—as a consequence—how this influences my confidence in the estimate of 52%. Large error means the point estimate is next-to useless; small error means it is likely more believable. The degree of error of the estimate tends to reduce as sample size increases, so without this information, I cannot assess how much error there is likely to be in the estimate.

## Sample Statistics & Credible Intervals

In statistics, we are interested in what’s called a population parameter, which (roughly) corresponds to the “true” value of a test of interest if we were able to sample the whole population. This is certainly what pollsters are interested in: they want to predict the outcome of the referendum.

Of course, this is never possible, so we take samples from the main population, and on the basis of the results obtained from this sample, estimate the true population parameter. Thus, our interest is in the population parameter, and we only use the sample parameter as a proxy to estimate what really interests us. The sample parameter is our point estimate from our sample, and we hope that it is representative of the true population parameter.

It’s very unlikely that the sample estimate exactly matches the population parameter, so ideally we would like to see a range of credible estimates of the true population parameter (a credible interval), rather than just a point estimate. For example, how confident can we be that the “true” result will be 52% given our sample responded 52%? How much more likely is 51% to be a true estimate? What about 49% (which would completely change the result of the referendum)? It’s impossible to tell from a point estimate.

I wanted to demonstrate how the size of the sample used can influence our confidence in the point estimates presented in these polls. To do this, I used R and Bayesian estimation to run some quick simulations of Yes/No polls whilst varying the total number of “people” asked. What I was interested in was how the confidence in the estimation of the population parameter varied as sample size increased.

The results are shown below. If you would like more detail about the simulation (and Bayesian estimation techniques), please see the end of the post. Disclaimer: I am no Bayesian expert, so if there are errors below I would love to hear about them. The message remains the same, though, even if the details might differ slightly.

## Results

The results for each sample size are shown in the Figure below. The sample size is shown at the top of each plot (“N = sample size”). These plots show probability density functions across all possible parameters (proportion = N[o], on the x-axis). It shows how we should assign credibility across the parameter space, given our prior beliefs and our data (see simulation details for these). Parameters that fall within the “bulk” of the distribution are far more likely to be the “true” parameter than parameters that fall outside the bulk of the distribution. Thus, Bayesian analysis provides us with a way to assess the range of credible parameters by finding those values which 95% of the density falls over. This is shown as red line in each of the above plots, and is known as the 95% Highest-Density Interval (HDI).

Values that fall within the HDI are more credible estimates of the true population parameter than those values that fall outside of it. Narrow HDIs suggest we have been able to provide a very precise estimate of the true parameter; wide HDIs reflect uncertainty in our estimates of the true parameter.

Remember, the value I used to simulate this data was Proportion N[o] = .52 (52%). We see that, for all sample sizes, the median of the distribution (i.e., the peak of the distribution) is roughly above this figure. Thus, the “point estimate” for most of these plots would coincide with the pollsters with 52%.

But, look at the width of the density function (and the red HDI) as sample size increases. For only 50 people being polled, the HDI runs from 0.38 to 0.68 (see top of figure for numbers). This means that our best estimate of the percentage of people saying “NO” ranges from 38% (landslide for the Yes campaign) to 65% (landslide for the No campaign). I’m not sure about you, but I’m not very confident that 52% is a very good estimate here, as there is such wild uncertainty.

On the contrary, with 2000 respondents, the HDI is much more narrow (because the density is much narrower); this reflects a very precise estimate of the population parameter; but even then, there is a 4% window of credible outcomes. That is, the true response could be anywhere from 50% (ambivalent) to 54% (clear edge to No campaign).

## Take-Home Message

Most, if not all, of the polls being conducted are sufficiently sampled. What gets my goat though is not the polls, but how they are reported. And this is not just true for this campaign. How many times have you seen an advert stating “…58% of people agree…”. My response is always the same: Without information on the sample size, I just don’t trust the percentage.

## Simulation Overview

If you would like the R-code I used, please shoot me an email! It uses the scripts presented in Lee & Wagenmakers’ (2014) SUPERB book “Bayesian Cognitive Modelling”.

1. I am interested in estimating the proportion of people responding “No”; I fixed this at the most recent estimate of 52%. Therefore, 52% is the “true” population parameter we are trying to estimate, and I simulated data using this percentage.
2. I varied the sample size of the simulated poll from 50 (ridiculously small) up to 2000 (pretty large) in various steps. In each simulated sample size, I fixed it so that 52% responded “No”.
3. I used Bayesian parameter estimation to determine—for each sample size—our best estimate of the population parameter. Beyesian estimation provides a complete probability distribution over the entire parameter space (called the posterior distribution), which basically shows how we should assign credibility across all of the possible parameter values , given our prior expectations and given the data. Wide posterior distributions suggest a large number of possible parameters are candidates to be the “true” parameter, whereas narrow posterior distributions suggest we have narrowed down the possibilities considerably.
4.  The prior is how much probability we provided to each parameter value before we saw any data. In our example, we need to provide a probability distribution over all possible percentages of responding “No” based on our expectations. For the simulation, I set the prior to be a uniform distribution over all parameters (a beta[1,1] for those interested; that is, the prior was very neutral in terms of our expectations.
5. Then, I set the responses to 52% of the sample size. This is our data. I modelled responding as being a binomial process, with k = proportion responding “No”, and n = total sample size.
6. Based on our prior and data, I used rjags to conduct the analysis, which produces one probability density function across all possible parameter values for each sample size. This is shown in the Figure presented.