# Fooled by Probability

Teaching statistics to psychology undergraduate is one of the most demanding (yet rewarding) aspects of my job. This job is likely made harder by the fact that humans have a very poor “intuitive” grasp of probability. In preparation for a class this coming semester, I revisited one of my favourite examples of just how bad we are at guessing probabilities (I had first read about this example in the excellent book “Why do Buses Come in Threes?”, by Rob Eastaway & Jeremy Wyndham).

Imagine you are teaching a class with 30 students in it. What is the probability that at least two of your students share the same birthday? Before you read on, have a think what you would estimate this probability to be…

Would you be surprised to know that you have about a 78%  chance (p[matching birthday] = ~.78) that at least two people share the same birthday? I was flabbergasted when I first heard this, as I had estimated the probability to be very low (as do most people). I had expected you need about 180 people before you even had a .50 probability of finding people with the same birthday, so with only 30 students it must be even less.

The book states that the solution for any class size n can be calculated by:

$p(n) = 1 * \left(1 - \frac{1}{365}\right) * \left(1 - \frac{2}{365}\right) * .... * \left(1 - \frac{n - 1}{365}\right)$

Even with this proof in hand, I still couldn’t believe the result. So, I did what I always do when I don’t understand (or “get”) some mathematics I come across: I set up a simulation in R.

## Simulating the Problem

The simulation is straightforward. For each of 99 different class sizes (n = 2 to 100, in steps of 1), I populated the class with n number of random students, each with their own randomly-selected birthday. Each birthday had an equal probability of being assigned to each student (with replacement) from the 365 days in a year. Then, I examined whether any two students shared the same birthday. For each class size, I repeated this procedure 10,000 times, and plotted the results.

The plots are found below. These plot the  average percentage of matched birthdays against the size of the class being considered. I was interested in how sharply the percentage of matches rises as the class size increases. I was particularly interested in two scenarios: 1) When does the probability equal .5? 2) When does the probability equal 1?

Even with the mathematical proof above, I was still surprised at the outcome.

Both plots are the same, but the upper plot answers my first question, and the latter plot answers my second one. It seems that you only need about 23 students (!!!) before you have a 50/50 chance of at least two students sharing the same birthday. This is WAY off my intuition of there needing to be about 180 students before you could even sniff at a 50% chance of a match.

The second plot shows that, to almost guarantee there being a match of birthdays, you need about 74 students.

We really are fooled by probability! Poor students!

## R Code for the Simulation

#------------------------------------------------------------------------------
### simulation parameters

# define vector of calendar dates
calendar <- seq(from = 1, to = 365, by = 1)

# how many simulations do you want per class size?
nSimulations <- 10000

# what is the size of each class? Here we have class sizes
# from 2 to 100
classSize <- seq(from = 2, to = 100, by = 1)
#------------------------------------------------------------------------------

#------------------------------------------------------------------------------
### start simulation here

# initiate vector to store data for each class size
classData <- numeric(length(classSize))

# initiate looping variable for class sizes
j <- 1

# loop over each class size
for(currSize in classSize){

# initiate vector to store counts of replicated birthdays
data <- numeric(nSimulations)

# start the loop over number of simulated classes
for(i in 1:nSimulations){

# sample the students' birthdayes
dates <- sample(calendar, currSize, replace = TRUE)

# count how many replications, and add to the data vector
count <- sum(duplicated(dates))

if(count > 0){
data[i] <- 1
}

} # end of current class size

# calculate the percentage of replicated birthdays across all simulated
# runs
classData[j] <- (sum(data) / nSimulations) * 100

# update looping variable
j <- j + 1

}
#------------------------------------------------------------------------------

#------------------------------------------------------------------------------
# now do some plotting
par(mfrow = c(2, 1))

plot(classSize, classData, ylab = "Percentage of Matched Birthdays",
xlab = "Class Size", pch = 2, type = "b")
segments(-2, 50, 22.8, 50, lty = 2, col = "red", lwd = 2)
segments(22.8, 50, 22.8, -2, lty = 2, col = "red", lwd = 2)
text(22, 5, pos = 4, "With ~23 students, p(matching birthday) = .5!", col = "red")

plot(classSize, classData, ylab = "Percentage of Matched Birthdays",
xlab = "Class Size", pch = 2, type = "b")
segments(-2, 100, 74, 100, lty = 2, col = "blue", lwd = 2)
segments(74, 100, 74, -2, lty = 2, col = "blue", lwd = 2)
text(74, 90, pos = 4, "With 74 students,
p(matching birthday) = 1!", col = "blue")
#------------------------------------------------------------------------------