I Heart Statistics

Students often look at me oddly when I try to express just how much I love statistics. Colleagues look at me oddly when I try to express how much I love computer simulation. Yes, they both can be dull (I suppose; I don’t see it, but whatever…), but they can also be pretty damn cool. I’m always on the lookout for examples of how statistics have been applied in neat and interesting ways, mostly for my own geeky interest, but also to potentially use to promote the utility of statistics to students.

This morning I came across a Tweet advertising a talk tonight in London by Dr Ruth King at London’s Mathematical Society. The title of the talk is “How to count invisible people”.  I was instantly interested! The question being addressed is how do you count people who don’t wish to be counted, such as the number of illegal immigrants  in the United Kingdom? A related problem is how to estimate the size of a given population if you can’t access the whole population.

It turns out you can arrive at a pretty decent estimate of the size of a population using something called the Lincoln-Peterson estimator, which I had not heard of before. It’s pretty neat! It demonstrates nicely the utility of statistics: they allow you to infer to a population what you have only measured in a sample from that population.

Say you wish to estimate the population of a large city. What you can do is go out on one particular day, and approach a set number of people (say 5,000), and take their names on a list (let’s call this list A). Then, come back on another day soon after, and approach another set number of people (say another 5,000), and put their names on a list (list B). Then, count the number of people that were present on both lists. Using the Lincoln-Peterson estimator, you can arrive at a pretty-decent estimate of the total population using the following equation: $Estimated Population = \left(\frac{N_{list A} * N_{list B}}{N_{BothLists}}\right)$

where N refers to “number”.

Simulating the Lincoln-Peterson Estimation Accuracy

How cool is that?? But does it work? If it does work, how accurate is it? I was skeptical that if you had 5,000 people on list A and 5,000 people on list B that you could accurately estimate the population size if the TRUE (but unknown) size was significantly larger, say 5 million. So, I wanted to test this.

I utilised computer simulation to arrive at an estimate of how accurate the formula is. Computer simulation is ideal for this as you can tell the computer what the true population size of a simulated city is. Thus, we can compare the estimate from the Lincoln-Peterson equation with the true and known population size. This can’t be done with “real-life”.

In my first simulation, I set the true population size to 50,000. I “approached” 5,000 people for list A, and 5,000 people for list B. The ratio of people on each list and the true population size is very small in this example, but I wanted to test how well the formula worked under very favourable circumstances. To get an estimate of the error in the Lincoln-Peterson formula, I repeated this census-taking 10,000 times, so I could plot the outcome of each census as a boxplot, which provides estimates of variance in the equation.

The Figure below shows the result of this simulation. As can be seen, the median of the simulated censi is spot-on with the “true” population size (shown as the horizontal red line). The whiskers have a range of median +/- 10%, which is pretty good! But this simulation was favourable for the equation, as the number sampled (5,000) was pretty close to the true population size (50,000). In the next simulation, I repeated the above simulation across a wide range of true population sizes, from 25,000 up to 5 million. It’s important to note that in each simulation I still only sampled 5,000 people on each day. The result is below. As can be seen, the simulation median always captures the true population size, although the variance in this estimate increases as the true population size increases. This is to be expected. But, the formula still works amazingly well. You can pretty accurately estimate a true population of 5 million people just by sampling 5,000 people! Amazing!

I was really impressed by this, and it’s a great example of the utilisation of statistics to get at real-world problems. So, if you’re in London tonight, go to that talk. I will be sad to miss it!

Here is the R code for the above simulations:

# Lincoln–Petersen estimator
#------------------------------------------------------------------------------
### First simulation: just one known population

# what is the true number in the popoulation?
nTrue <- 50000

# how many simulated censi to take?
nSims <- 10000

# vector to store estimate of each simulated census
data <- numeric(nSims)

## conduct the simulation!
for(i in 1:nSims){

# take the first census
nFirst <-5000
a <- sample(1:nTrue, size = nFirst, replace = FALSE)
a <- sort(a)

# take the second census
nSecond <- 5000
b <- sample(1:nTrue, size = nSecond, replace = FALSE)
b <- sort(b)

# how many people are on both lists?
nBoth <- which(a %in% b)
nBoth <- length(nBoth)

# calculate the estimated total population using the
# Lincoln–Petersen estimator
total <- (nFirst * nSecond) / nBoth

# remove zeroes from nBoth (when noone was present on both days)
if(nBoth == 0){
total = 0
}

# store the data of the current simulation
data[i] <- total
}

# plot the data
boxplot(data, ylab = "Estimated Population Size", main = nTrue)
abline(h = nTrue, lwd = 2, lty = 2, col = "red")
#------------------------------------------------------------------------------

#------------------------------------------------------------------------------
# change the plot to a 3x2 grid
par(mfrow = c(3, 2))

# disable scientific notation
options(scipen=999)

# what true populations to explore?
population <- c(25000, 100000, 250000, 500000, 1000000, 5000000)

# how many simulated censi to take?
nSims <- 10000

# iterate over each true population, and run the simulations
for(currPop in population){

# vector to store outcome of each simulated census
data <- numeric(nSims)

for(i in 1:nSims){

# take the first census
nFirst <- 5000
a <- sample(1:currPop, size = nFirst, replace = FALSE)
a <- sort(a)

# take the second census
nSecond <- 5000
b <- sample(1:currPop, size = nSecond, replace = FALSE)
b <- sort(b)

# how many are on both lists?
nBoth <- which(a %in% b)
nBoth <- length(nBoth)

# calculate the estimated total population using the
# Lincoln–Petersen estimator
total <- (nFirst * nSecond) / nBoth

# remove zeroes from nBoth
if(nBoth == 0){
total = 0
}

# store the data of the current simulation
data[i] <- total
}

# plot the data
boxplot(data, ylab = "Estimated Population Size", main = currPop)
abline(h = currPop, lwd = 2, lty = 2, col = "red")
}
#------------------------------------------------------------------------------