Month: September 2014

Polls: Why sample size matters

The people of Scotland are currently deciding in an upcoming “Yes-No” referendum (this Thursday!) whether to leave the British Union and become an independent nation. As is typical around times of heightened political interest, media outlets are awash with polls attempting to predict the outcome.  Recent polls have put the “No” response at about 52%, although this is reported to be reducing. 

What frustrates me about the reporting of these polls is that quite often—always?—the media do not inform us the sample size of the poll. It’s all well and good stating the “No” campaign has a 52% lead, but I don’t care about the point estimate (i.e., 52%); I care about how much error there is in this estimate, and—as a consequence—how this influences my confidence in the estimate of 52%. Large error means the point estimate is next-to useless; small error means it is likely more believable. The degree of error of the estimate tends to reduce as sample size increases, so without this information, I cannot assess how much error there is likely to be in the estimate.

Sample Statistics & Credible Intervals

In statistics, we are interested in what’s called a population parameter, which (roughly) corresponds to the “true” value of a test of interest if we were able to sample the whole population. This is certainly what pollsters are interested in: they want to predict the outcome of the referendum.

Of course, this is never possible, so we take samples from the main population, and on the basis of the results obtained from this sample, estimate the true population parameter. Thus, our interest is in the population parameter, and we only use the sample parameter as a proxy to estimate what really interests us. The sample parameter is our point estimate from our sample, and we hope that it is representative of the true population parameter.

It’s very unlikely that the sample estimate exactly matches the population parameter, so ideally we would like to see a range of credible estimates of the true population parameter (a credible interval), rather than just a point estimate. For example, how confident can we be that the “true” result will be 52% given our sample responded 52%? How much more likely is 51% to be a true estimate? What about 49% (which would completely change the result of the referendum)? It’s impossible to tell from a point estimate. 

I wanted to demonstrate how the size of the sample used can influence our confidence in the point estimates presented in these polls. To do this, I used R and Bayesian estimation to run some quick simulations of Yes/No polls whilst varying the total number of “people” asked. What I was interested in was how the confidence in the estimation of the population parameter varied as sample size increased. 

The results are shown below. If you would like more detail about the simulation (and Bayesian estimation techniques), please see the end of the post. Disclaimer: I am no Bayesian expert, so if there are errors below I would love to hear about them. The message remains the same, though, even if the details might differ slightly.

Results

The results for each sample size are shown in the Figure below. The sample size is shown at the top of each plot (“N = sample size”). polls

 

These plots show probability density functions across all possible parameters (proportion = N[o], on the x-axis). It shows how we should assign credibility across the parameter space, given our prior beliefs and our data (see simulation details for these). Parameters that fall within the “bulk” of the distribution are far more likely to be the “true” parameter than parameters that fall outside the bulk of the distribution. Thus, Bayesian analysis provides us with a way to assess the range of credible parameters by finding those values which 95% of the density falls over. This is shown as red line in each of the above plots, and is known as the 95% Highest-Density Interval (HDI). 

Values that fall within the HDI are more credible estimates of the true population parameter than those values that fall outside of it. Narrow HDIs suggest we have been able to provide a very precise estimate of the true parameter; wide HDIs reflect uncertainty in our estimates of the true parameter. 

Remember, the value I used to simulate this data was Proportion N[o] = .52 (52%). We see that, for all sample sizes, the median of the distribution (i.e., the peak of the distribution) is roughly above this figure. Thus, the “point estimate” for most of these plots would coincide with the pollsters with 52%.

But, look at the width of the density function (and the red HDI) as sample size increases. For only 50 people being polled, the HDI runs from 0.38 to 0.68 (see top of figure for numbers). This means that our best estimate of the percentage of people saying “NO” ranges from 38% (landslide for the Yes campaign) to 65% (landslide for the No campaign). I’m not sure about you, but I’m not very confident that 52% is a very good estimate here, as there is such wild uncertainty. 

On the contrary, with 2000 respondents, the HDI is much more narrow (because the density is much narrower); this reflects a very precise estimate of the population parameter; but even then, there is a 4% window of credible outcomes. That is, the true response could be anywhere from 50% (ambivalent) to 54% (clear edge to No campaign). 

Take-Home Message

Most, if not all, of the polls being conducted are sufficiently sampled. What gets my goat though is not the polls, but how they are reported. And this is not just true for this campaign. How many times have you seen an advert stating “…58% of people agree…”. My response is always the same: Without information on the sample size, I just don’t trust the percentage. 

Simulation Overview

If you would like the R-code I used, please shoot me an email! It uses the scripts presented in Lee & Wagenmakers’ (2014) SUPERB book “Bayesian Cognitive Modelling”. 

  1. I am interested in estimating the proportion of people responding “No”; I fixed this at the most recent estimate of 52%. Therefore, 52% is the “true” population parameter we are trying to estimate, and I simulated data using this percentage.
  2. I varied the sample size of the simulated poll from 50 (ridiculously small) up to 2000 (pretty large) in various steps. In each simulated sample size, I fixed it so that 52% responded “No”.  
  3. I used Bayesian parameter estimation to determine—for each sample size—our best estimate of the population parameter. Beyesian estimation provides a complete probability distribution over the entire parameter space (called the posterior distribution), which basically shows how we should assign credibility across all of the possible parameter values , given our prior expectations and given the data. Wide posterior distributions suggest a large number of possible parameters are candidates to be the “true” parameter, whereas narrow posterior distributions suggest we have narrowed down the possibilities considerably. 
  4.  The prior is how much probability we provided to each parameter value before we saw any data. In our example, we need to provide a probability distribution over all possible percentages of responding “No” based on our expectations. For the simulation, I set the prior to be a uniform distribution over all parameters (a beta[1,1] for those interested; that is, the prior was very neutral in terms of our expectations. 
  5. Then, I set the responses to 52% of the sample size. This is our data. I modelled responding as being a binomial process, with k = proportion responding “No”, and n = total sample size. 
  6. Based on our prior and data, I used rjags to conduct the analysis, which produces one probability density function across all possible parameter values for each sample size. This is shown in the Figure presented. 
Advertisements

Should I Switch to R?

R is a programming language primarily geared for statistical computing. Within psychology, it is fast becoming SPSS’s main competitor when it comes to conducting analysis. I have been using R for most of my work over the past 18 months, and I absolutely love it; it was not an exaggeration when I said in a previous post that R is my favourite thing EVER (OK, it’s top among work-related things, at least). I am a true R-convert, and I preach its existence to anyone who will listen. Halleluj-R!

This week on Twitter, someone asked others to list advantages of using R over SPSS in a teaching situation. Although I don’t use R for teaching (more on this below), it forced me to reflect on WHY I love R so much. So, I decided to list some of the core advantages I see R as having (in no particular order). In the spirit of fairness, I also reflected on some key disadvantages of using R.

I hope others find this of use before deciding whether to plunge into the R-world. I say dive right in.

Advantages

1. It’s Free. R is an open-source venture, so EVERYTHING you need in R is free. Yup; FREE. To me, this is so important, because it allows the skills you develop whilst learning R to travel with you regardless of where your next job is. Imagine only knowing SPSS, but then moving to an institution that don’t have an SPSS licence—what do you do? Are you going to fork out for the individual licence fee yourself (which, by the way, will expire after a measly 12 months)? 

2. Reproducible Analysis. It is very important that you be able to reproduce your analysis EXACTLY. It is embarrassing how many times in the past I have failed to be able to reproduce the same final response time averages after repeating a trimming procedure in Excel. How could I trust my data, or myself? After all, you can’t record mouse clicks in difference Excel menu options (unless you screen-capture your analysis session). 

As R is a statistical programming language, you write scripts that will execute your analysis. So, you have a permanent record of your analysis steps, and will be able to reproduce your analysis exactly. More importantly, if you publish your script as supplementary material, ANYONE will be able to reproduce your analysis exactly. This is so important in today’s age of reproducible science. 

3. Packages. R has hundreds of packages, which are add-ons to the core R system that allow you to do specific tasks more easily. They are a set of commands that have been programmed by some R user to execute certain functions more easily. For example, if you wish to use linear mixed effects models (which are becoming more popular in psychology), you can download the lme4 package, which allows you to conduct this analysis. Want to do structural equation modelling? Download the SEM package. There is a package for pretty much every statistical concept you can think of; importantly, all come fully-documented with examples. Again, these are all FREE. You don’t need to buy an AMOS licence as you would in SPSS. Why limit yourself to a set of pre-defined analytical tools? Get R.  

4. Programmable Functions. Can’t find a package that does the job you want? No problem! As R is a primarily a programming language, you can just write a function yourself that will do the job. You can even contribute to R by publishing your own package containing your new functions if you like. 

5. Sexy Plots. R has some absolutely stunning plotting capabilities (for example, by using the GGplot2 package—highly recommended!). These plots are of a publishable quality, and are in vector graphics, so they will not lose resolution when your publisher scales your plots up. Check out some of these example plots using GGplot2 for just the tip of the iceberg as to what R is capable of: http://docs.ggplot2.org/current/

6. Forces Deeper Engagement with Statistical Concepts. Because R isn’t a menu-driven point & click environment, you have to code your analysis using script. For me, this forces you to become more conversent with the techniques you are using, lest your misunderstanding leads you to code your analysis incorrectly. Even just doing plain data trimming in R makes you feel more intimate with your data, because you are coding how that data is to be manipulated. I forces you to think of EVERYTHING you are doing. SPSS can be executed with your eyes closed and your brain off. 

7. Computational Simulations. I do a lot of computer simulations, and R is an absolute god-send for this. Again, this is due to R being primarily a programming language. I have conducted simulations of human cognition, distribution of p-values under a null hypothesis, even annual rainfall in the UK! All in one environment. R is so versatile. If it involves numbers, and it can be programmed, R will do the job. 

8. Data Scraping. R also has several packages that allow you to scrape data from the internet for analysis. This is great for data geeks like me, who like to explore government/sport/financial data sets just for fun. Using R, you can get the data, arrange it for suitable analysis, conduct analysis, and plot analysis. All within the comfort of the R environment. 

9. Great Community. Most open source ventures have great community spirit, but I find R has one of the best. Whenever you get stuck with how to do something (and you WILL get stuck), you can be almost certain to find a an answer by a quick Google search, because someone in the community will have written how to do what you need. If they haven’t, there are several resources to use to seek help (such as StackExchange). 

Disadvantages

1. Steep Learning Curve. R is quite difficult to learn. I remember when I first saw an R script I almost threw up in my mouth. It was so intimidating. R takes a while to get comfortable with. For months whilst learning R I was thinking how easily I could do a certain analysis in SPSS, or how easy a plot would be to create in Excel. But, now I’m getting to grips with it, I can honestly say it is all worth it and that you should push through the pain barrier. At the end of it all, you will be left with a superior environment for your analytical needs.

2. Not Ideal for Undergraduates. Related to the above, it’s perhaps not best suited as an introductory software package for newbie-statisticians in psychology. This is because students in psychology often struggle with the statistical concepts themselves, so it would seem cruel to force them to learn a daunting programming language at the same time. (I have no data on this, and would love to hear from others who HAVE used it successfully at undergraduate level.) Also, at institutions where many staff teach on one module, you would have to ensure all of the staff are fluent in R before deploying it at undergraduate level. Everyone in psychology knows how to use SPSS, but this isn’t true for R. 

That being said, I think R is perfect for graduate level statistics. At this stage students should be comfortable with the basics of statistics so they can instead focus on learning to code. 

3. It Can be Slow. R isn’t a low-level language like C++, so executing a large set of analysis can take some time. Of course, this only really applies to LARGE sets of analysis. For example, at the moment I am running a computer simulation of performance on the Flanker task. I am trying to find best-fitting parameters by a repeated search across many potential values. For each run of the model, I simulate 50,000 trials, arrange the synthetic data, compare it to human data, find the discrepancy, and repeat until the discrepancy is minimal. This is being repeated for EACH of 30 participants in my data set. The whole simulation is due to take > 3 weeks. But, this is quite an unusual situation. Standard analyses of the type you would do in SPSS are almost instantaneous. But, it’s something to bear in mind.

(DISCLAIMER: The slow speed of my simulation could be due to my inefficient coding rather than R!)