R

Animating Robustness-Check of Bayes Factor Priors

Today I submitted a paper which contained some Bayesian analysis using Bayes factors (a default Bayesian t-test). As the test used a default prior on the effect size (a Cauchy distribution centred on zero with rate [r] = 0.707), I wanted to appease any reviewer concern that the final Bayes factor reported was some peculiarity of the prior rather than due to the data.

So, I performed a standard robustness check on the prior where I plot the resulting Bayes factor from a wide range of prior value (in this case, across a wide range of Cauchy rate parameters). The plot below is the result:paperRobust

The plot shows the Bayes factor with the default prior as the black dot; the line represents how the Bayes factor changes as a function of the Cauchy prior width. This analysis suggests the final conclusion is not a peculiarity of the prior, and as such the reported Bayes factor is quite robust.

What this plot doesn’t show, though, is what the prior distribution looks like as the rate (r) changes. Ideally, I wanted to show what the Cauchy distribution looks like across the entire range of r values explored in the plot above; this helps me (and I guess other people) understand how the prior is influencing the Bayes factor. Visualisation is very important sometimes.

So, I knocked up some R code which plots the robustness check above, but plots next to it the density of the Cauchy distribution with the rate parameter currently being explored. The animation cycles through increasing values for rate r, and shows the density for this rate on the left, and the Bayes factor for my data with this prior on the right. For me, this animation provides a nice insight to how the Bayes factor is being influenced by the prior.

output_ZHhAVd

R Code

The code for generating the individual plots is below. Note that all of these plots then need to be fed into your favourite gif-making device! (I used http://gifmaker.me/).

Advertisements

Fooled by Probability

Teaching statistics to psychology undergraduate is one of the most demanding (yet rewarding) aspects of my job. This job is likely made harder by the fact that humans have a very poor “intuitive” grasp of probability. In preparation for a class this coming semester, I revisited one of my favourite examples of just how bad we are at guessing probabilities (I had first read about this example in the excellent book “Why do Buses Come in Threes?”, by Rob Eastaway & Jeremy Wyndham).

Imagine you are teaching a class with 30 students in it. What is the probability that at least two of your students share the same birthday? Before you read on, have a think what you would estimate this probability to be…

Birthday_candles

The Answer (and Proof!)

Would you be surprised to know that you have about a 78%  chance (p[matching birthday] = ~.78) that at least two people share the same birthday? I was flabbergasted when I first heard this, as I had estimated the probability to be very low (as do most people). I had expected you need about 180 people before you even had a .50 probability of finding people with the same birthday, so with only 30 students it must be even less.

The book states that the solution for any class size n can be calculated by:

p(n) = 1 * \left(1 - \frac{1}{365}\right) * \left(1 - \frac{2}{365}\right) * .... * \left(1 - \frac{n - 1}{365}\right)

Even with this proof in hand, I still couldn’t believe the result. So, I did what I always do when I don’t understand (or “get”) some mathematics I come across: I set up a simulation in R. 

Simulating the Problem

The simulation is straightforward. For each of 99 different class sizes (n = 2 to 100, in steps of 1), I populated the class with n number of random students, each with their own randomly-selected birthday. Each birthday had an equal probability of being assigned to each student (with replacement) from the 365 days in a year. Then, I examined whether any two students shared the same birthday. For each class size, I repeated this procedure 10,000 times, and plotted the results.

The plots are found below. These plot the  average percentage of matched birthdays against the size of the class being considered. I was interested in how sharply the percentage of matches rises as the class size increases. I was particularly interested in two scenarios: 1) When does the probability equal .5? 2) When does the probability equal 1?

Even with the mathematical proof above, I was still surprised at the outcome. Rplot

Both plots are the same, but the upper plot answers my first question, and the latter plot answers my second one. It seems that you only need about 23 students (!!!) before you have a 50/50 chance of at least two students sharing the same birthday. This is WAY off my intuition of there needing to be about 180 students before you could even sniff at a 50% chance of a match. 

The second plot shows that, to almost guarantee there being a match of birthdays, you need about 74 students. 

We really are fooled by probability! Poor students!

 R Code for the Simulation

#------------------------------------------------------------------------------
### simulation parameters

# define vector of calendar dates
calendar <- seq(from = 1, to = 365, by = 1)

# how many simulations do you want per class size?
nSimulations <- 10000

# what is the size of each class? Here we have class sizes
# from 2 to 100
classSize <- seq(from = 2, to = 100, by = 1)
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
### start simulation here

# initiate vector to store data for each class size
classData <- numeric(length(classSize))

# initiate looping variable for class sizes
j <- 1

# loop over each class size
for(currSize in classSize){
  
  # initiate vector to store counts of replicated birthdays
  data <- numeric(nSimulations)
  
  # start the loop over number of simulated classes
  for(i in 1:nSimulations){
    
    # sample the students' birthdayes
    dates <- sample(calendar, currSize, replace = TRUE)
    
    # count how many replications, and add to the data vector
    count <- sum(duplicated(dates))
    
      if(count > 0){
        data[i] <- 1
      } 
      
  } # end of current class size
  
  # calculate the percentage of replicated birthdays across all simulated 
  # runs
  classData[j] <- (sum(data) / nSimulations) * 100
  
  # update looping variable
  j <- j + 1
  
}
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
# now do some plotting
par(mfrow = c(2, 1))

plot(classSize, classData, ylab = "Percentage of Matched Birthdays", 
     xlab = "Class Size", pch = 2, type = "b")
segments(-2, 50, 22.8, 50, lty = 2, col = "red", lwd = 2)
segments(22.8, 50, 22.8, -2, lty = 2, col = "red", lwd = 2)
text(22, 5, pos = 4, "With ~23 students, p(matching birthday) = .5!", col = "red")

plot(classSize, classData, ylab = "Percentage of Matched Birthdays", 
     xlab = "Class Size", pch = 2, type = "b")
segments(-2, 100, 74, 100, lty = 2, col = "blue", lwd = 2)
segments(74, 100, 74, -2, lty = 2, col = "blue", lwd = 2)
text(74, 90, pos = 4, "With 74 students, 
     p(matching birthday) = 1!", col = "blue")
#------------------------------------------------------------------------------

Should I Switch to R?

R is a programming language primarily geared for statistical computing. Within psychology, it is fast becoming SPSS’s main competitor when it comes to conducting analysis. I have been using R for most of my work over the past 18 months, and I absolutely love it; it was not an exaggeration when I said in a previous post that R is my favourite thing EVER (OK, it’s top among work-related things, at least). I am a true R-convert, and I preach its existence to anyone who will listen. Halleluj-R!

This week on Twitter, someone asked others to list advantages of using R over SPSS in a teaching situation. Although I don’t use R for teaching (more on this below), it forced me to reflect on WHY I love R so much. So, I decided to list some of the core advantages I see R as having (in no particular order). In the spirit of fairness, I also reflected on some key disadvantages of using R.

I hope others find this of use before deciding whether to plunge into the R-world. I say dive right in.

Advantages

1. It’s Free. R is an open-source venture, so EVERYTHING you need in R is free. Yup; FREE. To me, this is so important, because it allows the skills you develop whilst learning R to travel with you regardless of where your next job is. Imagine only knowing SPSS, but then moving to an institution that don’t have an SPSS licence—what do you do? Are you going to fork out for the individual licence fee yourself (which, by the way, will expire after a measly 12 months)? 

2. Reproducible Analysis. It is very important that you be able to reproduce your analysis EXACTLY. It is embarrassing how many times in the past I have failed to be able to reproduce the same final response time averages after repeating a trimming procedure in Excel. How could I trust my data, or myself? After all, you can’t record mouse clicks in difference Excel menu options (unless you screen-capture your analysis session). 

As R is a statistical programming language, you write scripts that will execute your analysis. So, you have a permanent record of your analysis steps, and will be able to reproduce your analysis exactly. More importantly, if you publish your script as supplementary material, ANYONE will be able to reproduce your analysis exactly. This is so important in today’s age of reproducible science. 

3. Packages. R has hundreds of packages, which are add-ons to the core R system that allow you to do specific tasks more easily. They are a set of commands that have been programmed by some R user to execute certain functions more easily. For example, if you wish to use linear mixed effects models (which are becoming more popular in psychology), you can download the lme4 package, which allows you to conduct this analysis. Want to do structural equation modelling? Download the SEM package. There is a package for pretty much every statistical concept you can think of; importantly, all come fully-documented with examples. Again, these are all FREE. You don’t need to buy an AMOS licence as you would in SPSS. Why limit yourself to a set of pre-defined analytical tools? Get R.  

4. Programmable Functions. Can’t find a package that does the job you want? No problem! As R is a primarily a programming language, you can just write a function yourself that will do the job. You can even contribute to R by publishing your own package containing your new functions if you like. 

5. Sexy Plots. R has some absolutely stunning plotting capabilities (for example, by using the GGplot2 package—highly recommended!). These plots are of a publishable quality, and are in vector graphics, so they will not lose resolution when your publisher scales your plots up. Check out some of these example plots using GGplot2 for just the tip of the iceberg as to what R is capable of: http://docs.ggplot2.org/current/

6. Forces Deeper Engagement with Statistical Concepts. Because R isn’t a menu-driven point & click environment, you have to code your analysis using script. For me, this forces you to become more conversent with the techniques you are using, lest your misunderstanding leads you to code your analysis incorrectly. Even just doing plain data trimming in R makes you feel more intimate with your data, because you are coding how that data is to be manipulated. I forces you to think of EVERYTHING you are doing. SPSS can be executed with your eyes closed and your brain off. 

7. Computational Simulations. I do a lot of computer simulations, and R is an absolute god-send for this. Again, this is due to R being primarily a programming language. I have conducted simulations of human cognition, distribution of p-values under a null hypothesis, even annual rainfall in the UK! All in one environment. R is so versatile. If it involves numbers, and it can be programmed, R will do the job. 

8. Data Scraping. R also has several packages that allow you to scrape data from the internet for analysis. This is great for data geeks like me, who like to explore government/sport/financial data sets just for fun. Using R, you can get the data, arrange it for suitable analysis, conduct analysis, and plot analysis. All within the comfort of the R environment. 

9. Great Community. Most open source ventures have great community spirit, but I find R has one of the best. Whenever you get stuck with how to do something (and you WILL get stuck), you can be almost certain to find a an answer by a quick Google search, because someone in the community will have written how to do what you need. If they haven’t, there are several resources to use to seek help (such as StackExchange). 

Disadvantages

1. Steep Learning Curve. R is quite difficult to learn. I remember when I first saw an R script I almost threw up in my mouth. It was so intimidating. R takes a while to get comfortable with. For months whilst learning R I was thinking how easily I could do a certain analysis in SPSS, or how easy a plot would be to create in Excel. But, now I’m getting to grips with it, I can honestly say it is all worth it and that you should push through the pain barrier. At the end of it all, you will be left with a superior environment for your analytical needs.

2. Not Ideal for Undergraduates. Related to the above, it’s perhaps not best suited as an introductory software package for newbie-statisticians in psychology. This is because students in psychology often struggle with the statistical concepts themselves, so it would seem cruel to force them to learn a daunting programming language at the same time. (I have no data on this, and would love to hear from others who HAVE used it successfully at undergraduate level.) Also, at institutions where many staff teach on one module, you would have to ensure all of the staff are fluent in R before deploying it at undergraduate level. Everyone in psychology knows how to use SPSS, but this isn’t true for R. 

That being said, I think R is perfect for graduate level statistics. At this stage students should be comfortable with the basics of statistics so they can instead focus on learning to code. 

3. It Can be Slow. R isn’t a low-level language like C++, so executing a large set of analysis can take some time. Of course, this only really applies to LARGE sets of analysis. For example, at the moment I am running a computer simulation of performance on the Flanker task. I am trying to find best-fitting parameters by a repeated search across many potential values. For each run of the model, I simulate 50,000 trials, arrange the synthetic data, compare it to human data, find the discrepancy, and repeat until the discrepancy is minimal. This is being repeated for EACH of 30 participants in my data set. The whole simulation is due to take > 3 weeks. But, this is quite an unusual situation. Standard analyses of the type you would do in SPSS are almost instantaneous. But, it’s something to bear in mind.

(DISCLAIMER: The slow speed of my simulation could be due to my inefficient coding rather than R!) 

Saving Multiple Plots to PDF in R

Sometimes when doing data analysis I need to create multiple plots. In R, you can save each plot to a separate file, but this leads to the problem of how to view each plot rapidly (for example if you wish to compare several plots in quick succession). 

In the video below, I show you how to get around this problem by saving each plot to a separate page within ONE PDF file. This makes it easy to view each plot in quick succession. Indeed, in the video, I show you how this can lead to a form of animation, showing how data changes with each plot.

Friedman’s test with tied data

I completely sympathise with students sometimes when they say they hate statistics. I really do. A colleague emailed me the other day in response to a student query regarding difficulty in getting their hand-calculation for a Friedman test to match that produced by SPSS (shout-out to this eagle-eyed student!!). I smiled to myself, thinking “silly SPSS, giving the wrong results again!”. Then, in a flurry of smugness, I fired up R, entered the data my colleague had provided, and found to my utter surprise that R, too, was producing the “incorrect” result. I checked the hand calculations in Excel and they were all fine. So how come R and SPSS were giving different (apparently incorrect) results?

Here are the data (including the ranks):

 

Friedman

 

The equation used in the hand-calculations was

reg Friedman

 

where n is the number of observations per condition (12), k is the number of conditions (3), and Ri is the sum of the ith column (condition). It’s a gnarly-looking equation, and students gulp when it’s presented.

However, it turns out this equation is only correct if there are no ties in your data. My colleague informed me this was not mentioned in the book he had access to, and I have since checked 3 books that I have at home (haven’t checked the ones in my office, yet), and all of them give the above formula and make no mention of a correction-for-ties.

The interweb didn’t help much, either, in trying to work out what was going wrong, but I stumbled across a book chapter on Google books (link is here) which gives a corrected formula. To help others, it’s repeated below:

correct Friedman

Wikipedia to the rescue (!)

It turns out that Wikipedia has an excellent page for Friedman’s test which provides yet another (set of) formula(s). I’m not quite sure the original source that Wikipedia got this equation from, but it turns out that this is a general formula that works for tied data and non-tied data alike. So, it makes sense that R might be using this formula (it would be more economical than having to continually test whether it should use the first or the second equation reported above).

What have I learned?

  • Students are very observant. Had this student not contacted my colleague (and he not subsequently contact me), this error would have crept into next year, too. Genuine thanks to them for coming forward and questioning my lecture slides. 
  • Statistics is hard. How can we expect students to “get” all of this when I didn’t even know there was a mistake here.
  • Most textbooks discussing this test don’t use tied data, and so use the first equation above. This does not work with tied data. You can use the second equation, but best use the generalised equation on Wikipedia.
  • Wikipedia is (sometimes, at least!) accurate!
  • R is still ace.

For those interested, here is the R code for performing the Friedman equations from Wikipedia. This is just to show what it’s doing; you can just see the R help pages for ?friedman.test for a shortcut to computing the test.

rankData <-
  matrix(c(2.5, 2.5, 1,
           3, 1.5, 1.5,
           1.5, 3, 1.5,
           2, 3, 1,
           1.5, 3, 1.5,
           3, 1.5, 1.5,
           2, 2, 2,
           3, 1.5, 1.5,
           2.5, 1, 2.5,
           3, 1.5, 1.5,
           2, 3, 1,
           2.5, 2.5, 1),
         nrow = 12,
         byrow = TRUE,
         dimnames = list(1:12, c("None", "Classical", "Dance")))

rbar_none <- (1/12) * sum(rankData[, 1])
rbar_classical <- (1/12) * sum(rankData[, 2])
rbar_dance <- (1/12) * sum(rankData[, 3])

rbar <- (1/(3*12)) * sum(rankData)

ssT <- 12* sum(((rbar_none - rbar)^2), ((rbar_classical - rbar) ^ 2), ((rbar_dance - rbar) ^ 2))
ssE <- (1/(12*2)) * sum(((rankData - rbar) ^ 2))

Q <- ssT/ssE

Response Time Modelling – How Many Trials?

Over the past year I have become increasingly interested in using models of simple decision making processes, as measured by two-choice response time tasks. A simple 2-choice task might ask participants to judge whether a presented number is odd or even, for example. One popular model of 2-choice RT is the Ratcliff Diffusion Model. A simplified example of this model is presented below.

diffusion model

The idea behind the model is that, once presented, the stimulus elicits an evidence-accumulation process. That is, the cognitive system begins sampling the stimulus, and evidence begins to accumulate in a noisy manner towards one of two response boundaries. Once a boundary has been reached, that response is executed. One boundary reflects the correct response, and the other reflects the incorrect response. The average rate of evidence accumulation (which is a drift diffusion process) is reflected in a parameter drift rate. The amount of evidence that needs to accumulate before a response is made is governed by the distance between the two boundaries (boundary separation parameter). The height of these boundaries are symmetrical around a starting point of the drift diffusion process, reflected in a parameter bias. The bias is typically midway between the two boundaries, as (for example) 50% of trials will present odd stimuli, so the participant will have no reason to prefer one response over the other. The final parameter is called non-decision time, which reflects the time it takes to encode the stimulus and execute the motor response. The model is very complex mathematically and computationally, but implementing the model has been made easier over recent years due to the release of several software packages. Among these are E-J Wagenmakers and colleagues’ EZ diffusion model, Vandekerckhove & Tuerlinckx’s DMAT toolbox for MatLab, and Voss and Voss’ Fast-DM. For an excellent overview of the diffusion model, as well as available packages, see this paper by E-J Wagenmakers.

However, a new package (the “RWiener” package) has just been published by Wabersich and Vandekerckhove (see paper here) for R-Statistics. R is my favourite thing EVER (seriously), so I was thrilled to see an R package that can model response times.

How many trials?

The package is very straightforward; it can generate synthetic data with known parameters as well as find best-fitting parameters from data you might have, which is superb. One question I was unsure of, though, is how accurate this parameter estimation routine is for varying number of trials in an experiment. That is, how many trials do I need as a minimum in my experiment for this package to produce accurate parameter estimation? With human data, this is a very difficult (impossible?) question to answer, but is tractable when fitting the model to artificially-generated data; that is, because we know which parameters gave rise to the data (because we tell the computer which to use), we can tell whether the estimated parameters are accurate (compare estimated parameters to “known” parameters).

I set up a simulation where I addressed the “how many trials?” question. The structure is straightforward:

  • Decide on some “known” parameters
  • Generate N trials with the known parameter for X number of simulated “experiments”
  • Fit the model to each simulated experiment
  • Compare estimated parameters to known parameters
  • Repeat for various values of N

For the simulations reported, I simulated experiments which had Ns of 25, 50, 100, 250, 500, and 1000 trials. This spans from the ridiculously small experiments (N=25) to the ridiculously large experiments (N=1000). For each N, I conducted 100 simulated experiments. Then, I produced boxplots for each parameter which allowed me to compare the estimated parameters to the known parameters (which is shown as a horizontal line on each plot). Ns which recover the known parameters well will have boxplots which cluster closely to the horizontal line for each parameter.

Rplot

Conclusion

The fitting routine is able to produce reasonably accurate parameters estimates with as little as 250 trials; beyond this, there is not much advantage to increasing trial numbers. Note that 100 trials is still OK, but more will produce better estimates. There is no systematic bias in these estimates with low trial numbers except for the non-decision parameter; estimates seemed to be too high with lower trial numbers, and this increased bias decreased as trial numbers are boosted.

The RWiener package is a welcome addition for fitting models to response time data, producing reasonably accurate parameter estimates with as little as ~250 trials.

 

R Code for above simulation

The code for the simulation is below. (Note, this takes quite a while to run, as my code is likely very un-elegant.)

rm(list = ls())

library(RWiener)

#how many trials in each sample condition?
nTrials <- c(25, 50, 100, 250, 500, 1000)
#how many simulations for each sample?
nSims <- 100

#known parameters
parms <- c(2.00, #boundary separation
           0.30, #non-decision time
           0.50, #initial bias
           0.50) #drift rate

#what are the starting parmaeters for the parameter minimisation routine?
startParms <- c(1, 0.1, 0.1, 1) #boundary, non-decision, initial bias, & drift

#start simulations here
for(i in 1:length(nTrials)){ #for each number of trials (25, 50, etc...)

  #get the current number of trials to generate
  tempTrials <- nTrials[i] 

  #generate empty matrix to store results
  tempMatrix <- matrix(0, nrow=nSims, ncol=length(parms))

  for(j in 1:nSims){ #do a loop over total number of simulations required

    #generate synthetic data
    tempData <- rwiener(n=tempTrials, alpha=parms[1], tau=parms[2], beta=parms[3],
                        delta=parms[4])
    #now fit the model to this synthetic data
    findParms <- optim(startParms, wiener_deviance, dat=tempData, method="Nelder-Mead")

    #store the best-fitting parameters
    tempMatrix[j, ] <- findParms$par
  }

  #store the matrix of results as a unique matrix
  assign(paste("data", tempTrials, sep = ""), tempMatrix)
}

#do the plotting

par(mfrow=c(2,2)) #change layout to 2x2

#plot boundary
boxplot(data25[, 1], data50[, 1], data100[, 1], data250[, 1], data500[, 1], data1000[, 1],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Boundary Separation")
  abline(a=parms[1], b=0)

#plot non-decision time
boxplot(data25[, 2], data50[, 2], data100[, 2], data250[, 2], data500[, 2], data1000[, 2],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Non-Decision Time")
abline(a=parms[2], b=0)

#plot initial bias
boxplot(data25[, 3], data50[, 3], data100[, 3], data250[, 3], data500[, 3], data1000[, 3],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Initial Bias")
abline(a=parms[3], b=0)

#plot drift rate
boxplot(data25[, 4], data50[, 4], data100[, 4], data250[, 4], data500[, 4], data1000[, 4],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Drift Rate")
abline(a=parms[4], b=0)