simulation

Fooled by Probability

Teaching statistics to psychology undergraduate is one of the most demanding (yet rewarding) aspects of my job. This job is likely made harder by the fact that humans have a very poor “intuitive” grasp of probability. In preparation for a class this coming semester, I revisited one of my favourite examples of just how bad we are at guessing probabilities (I had first read about this example in the excellent book “Why do Buses Come in Threes?”, by Rob Eastaway & Jeremy Wyndham).

Imagine you are teaching a class with 30 students in it. What is the probability that at least two of your students share the same birthday? Before you read on, have a think what you would estimate this probability to be…

Birthday_candles

The Answer (and Proof!)

Would you be surprised to know that you have about a 78%  chance (p[matching birthday] = ~.78) that at least two people share the same birthday? I was flabbergasted when I first heard this, as I had estimated the probability to be very low (as do most people). I had expected you need about 180 people before you even had a .50 probability of finding people with the same birthday, so with only 30 students it must be even less.

The book states that the solution for any class size n can be calculated by:

p(n) = 1 * \left(1 - \frac{1}{365}\right) * \left(1 - \frac{2}{365}\right) * .... * \left(1 - \frac{n - 1}{365}\right)

Even with this proof in hand, I still couldn’t believe the result. So, I did what I always do when I don’t understand (or “get”) some mathematics I come across: I set up a simulation in R. 

Simulating the Problem

The simulation is straightforward. For each of 99 different class sizes (n = 2 to 100, in steps of 1), I populated the class with n number of random students, each with their own randomly-selected birthday. Each birthday had an equal probability of being assigned to each student (with replacement) from the 365 days in a year. Then, I examined whether any two students shared the same birthday. For each class size, I repeated this procedure 10,000 times, and plotted the results.

The plots are found below. These plot the  average percentage of matched birthdays against the size of the class being considered. I was interested in how sharply the percentage of matches rises as the class size increases. I was particularly interested in two scenarios: 1) When does the probability equal .5? 2) When does the probability equal 1?

Even with the mathematical proof above, I was still surprised at the outcome. Rplot

Both plots are the same, but the upper plot answers my first question, and the latter plot answers my second one. It seems that you only need about 23 students (!!!) before you have a 50/50 chance of at least two students sharing the same birthday. This is WAY off my intuition of there needing to be about 180 students before you could even sniff at a 50% chance of a match. 

The second plot shows that, to almost guarantee there being a match of birthdays, you need about 74 students. 

We really are fooled by probability! Poor students!

 R Code for the Simulation

#------------------------------------------------------------------------------
### simulation parameters

# define vector of calendar dates
calendar <- seq(from = 1, to = 365, by = 1)

# how many simulations do you want per class size?
nSimulations <- 10000

# what is the size of each class? Here we have class sizes
# from 2 to 100
classSize <- seq(from = 2, to = 100, by = 1)
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
### start simulation here

# initiate vector to store data for each class size
classData <- numeric(length(classSize))

# initiate looping variable for class sizes
j <- 1

# loop over each class size
for(currSize in classSize){
  
  # initiate vector to store counts of replicated birthdays
  data <- numeric(nSimulations)
  
  # start the loop over number of simulated classes
  for(i in 1:nSimulations){
    
    # sample the students' birthdayes
    dates <- sample(calendar, currSize, replace = TRUE)
    
    # count how many replications, and add to the data vector
    count <- sum(duplicated(dates))
    
      if(count > 0){
        data[i] <- 1
      } 
      
  } # end of current class size
  
  # calculate the percentage of replicated birthdays across all simulated 
  # runs
  classData[j] <- (sum(data) / nSimulations) * 100
  
  # update looping variable
  j <- j + 1
  
}
#------------------------------------------------------------------------------


#------------------------------------------------------------------------------
# now do some plotting
par(mfrow = c(2, 1))

plot(classSize, classData, ylab = "Percentage of Matched Birthdays", 
     xlab = "Class Size", pch = 2, type = "b")
segments(-2, 50, 22.8, 50, lty = 2, col = "red", lwd = 2)
segments(22.8, 50, 22.8, -2, lty = 2, col = "red", lwd = 2)
text(22, 5, pos = 4, "With ~23 students, p(matching birthday) = .5!", col = "red")

plot(classSize, classData, ylab = "Percentage of Matched Birthdays", 
     xlab = "Class Size", pch = 2, type = "b")
segments(-2, 100, 74, 100, lty = 2, col = "blue", lwd = 2)
segments(74, 100, 74, -2, lty = 2, col = "blue", lwd = 2)
text(74, 90, pos = 4, "With 74 students, 
     p(matching birthday) = 1!", col = "blue")
#------------------------------------------------------------------------------

Response Time Modelling – How Many Trials?

Over the past year I have become increasingly interested in using models of simple decision making processes, as measured by two-choice response time tasks. A simple 2-choice task might ask participants to judge whether a presented number is odd or even, for example. One popular model of 2-choice RT is the Ratcliff Diffusion Model. A simplified example of this model is presented below.

diffusion model

The idea behind the model is that, once presented, the stimulus elicits an evidence-accumulation process. That is, the cognitive system begins sampling the stimulus, and evidence begins to accumulate in a noisy manner towards one of two response boundaries. Once a boundary has been reached, that response is executed. One boundary reflects the correct response, and the other reflects the incorrect response. The average rate of evidence accumulation (which is a drift diffusion process) is reflected in a parameter drift rate. The amount of evidence that needs to accumulate before a response is made is governed by the distance between the two boundaries (boundary separation parameter). The height of these boundaries are symmetrical around a starting point of the drift diffusion process, reflected in a parameter bias. The bias is typically midway between the two boundaries, as (for example) 50% of trials will present odd stimuli, so the participant will have no reason to prefer one response over the other. The final parameter is called non-decision time, which reflects the time it takes to encode the stimulus and execute the motor response. The model is very complex mathematically and computationally, but implementing the model has been made easier over recent years due to the release of several software packages. Among these are E-J Wagenmakers and colleagues’ EZ diffusion model, Vandekerckhove & Tuerlinckx’s DMAT toolbox for MatLab, and Voss and Voss’ Fast-DM. For an excellent overview of the diffusion model, as well as available packages, see this paper by E-J Wagenmakers.

However, a new package (the “RWiener” package) has just been published by Wabersich and Vandekerckhove (see paper here) for R-Statistics. R is my favourite thing EVER (seriously), so I was thrilled to see an R package that can model response times.

How many trials?

The package is very straightforward; it can generate synthetic data with known parameters as well as find best-fitting parameters from data you might have, which is superb. One question I was unsure of, though, is how accurate this parameter estimation routine is for varying number of trials in an experiment. That is, how many trials do I need as a minimum in my experiment for this package to produce accurate parameter estimation? With human data, this is a very difficult (impossible?) question to answer, but is tractable when fitting the model to artificially-generated data; that is, because we know which parameters gave rise to the data (because we tell the computer which to use), we can tell whether the estimated parameters are accurate (compare estimated parameters to “known” parameters).

I set up a simulation where I addressed the “how many trials?” question. The structure is straightforward:

  • Decide on some “known” parameters
  • Generate N trials with the known parameter for X number of simulated “experiments”
  • Fit the model to each simulated experiment
  • Compare estimated parameters to known parameters
  • Repeat for various values of N

For the simulations reported, I simulated experiments which had Ns of 25, 50, 100, 250, 500, and 1000 trials. This spans from the ridiculously small experiments (N=25) to the ridiculously large experiments (N=1000). For each N, I conducted 100 simulated experiments. Then, I produced boxplots for each parameter which allowed me to compare the estimated parameters to the known parameters (which is shown as a horizontal line on each plot). Ns which recover the known parameters well will have boxplots which cluster closely to the horizontal line for each parameter.

Rplot

Conclusion

The fitting routine is able to produce reasonably accurate parameters estimates with as little as 250 trials; beyond this, there is not much advantage to increasing trial numbers. Note that 100 trials is still OK, but more will produce better estimates. There is no systematic bias in these estimates with low trial numbers except for the non-decision parameter; estimates seemed to be too high with lower trial numbers, and this increased bias decreased as trial numbers are boosted.

The RWiener package is a welcome addition for fitting models to response time data, producing reasonably accurate parameter estimates with as little as ~250 trials.

 

R Code for above simulation

The code for the simulation is below. (Note, this takes quite a while to run, as my code is likely very un-elegant.)

rm(list = ls())

library(RWiener)

#how many trials in each sample condition?
nTrials <- c(25, 50, 100, 250, 500, 1000)
#how many simulations for each sample?
nSims <- 100

#known parameters
parms <- c(2.00, #boundary separation
           0.30, #non-decision time
           0.50, #initial bias
           0.50) #drift rate

#what are the starting parmaeters for the parameter minimisation routine?
startParms <- c(1, 0.1, 0.1, 1) #boundary, non-decision, initial bias, & drift

#start simulations here
for(i in 1:length(nTrials)){ #for each number of trials (25, 50, etc...)

  #get the current number of trials to generate
  tempTrials <- nTrials[i] 

  #generate empty matrix to store results
  tempMatrix <- matrix(0, nrow=nSims, ncol=length(parms))

  for(j in 1:nSims){ #do a loop over total number of simulations required

    #generate synthetic data
    tempData <- rwiener(n=tempTrials, alpha=parms[1], tau=parms[2], beta=parms[3],
                        delta=parms[4])
    #now fit the model to this synthetic data
    findParms <- optim(startParms, wiener_deviance, dat=tempData, method="Nelder-Mead")

    #store the best-fitting parameters
    tempMatrix[j, ] <- findParms$par
  }

  #store the matrix of results as a unique matrix
  assign(paste("data", tempTrials, sep = ""), tempMatrix)
}

#do the plotting

par(mfrow=c(2,2)) #change layout to 2x2

#plot boundary
boxplot(data25[, 1], data50[, 1], data100[, 1], data250[, 1], data500[, 1], data1000[, 1],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Boundary Separation")
  abline(a=parms[1], b=0)

#plot non-decision time
boxplot(data25[, 2], data50[, 2], data100[, 2], data250[, 2], data500[, 2], data1000[, 2],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Non-Decision Time")
abline(a=parms[2], b=0)

#plot initial bias
boxplot(data25[, 3], data50[, 3], data100[, 3], data250[, 3], data500[, 3], data1000[, 3],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Initial Bias")
abline(a=parms[3], b=0)

#plot drift rate
boxplot(data25[, 4], data50[, 4], data100[, 4], data250[, 4], data500[, 4], data1000[, 4],
        names = nTrials, xlab = "Number of Trials", ylab = "Parameter estimate", main = "Drift Rate")
abline(a=parms[4], b=0)