I completely sympathise with students sometimes when they say they hate statistics. I really do. A colleague emailed me the other day in response to a student query regarding difficulty in getting their hand-calculation for a Friedman test to match that produced by SPSS (shout-out to this eagle-eyed student!!). I smiled to myself, thinking “silly SPSS, giving the wrong results again!”. Then, in a flurry of smugness, I fired up R, entered the data my colleague had provided, and found to my utter surprise that R, too, was producing the “incorrect” result. I checked the hand calculations in Excel and they were all fine. So how come R and SPSS were giving different (apparently incorrect) results?
Here are the data (including the ranks):
The equation used in the hand-calculations was
where n is the number of observations per condition (12), k is the number of conditions (3), and Ri is the sum of the ith column (condition). It’s a gnarly-looking equation, and students gulp when it’s presented.
However, it turns out this equation is only correct if there are no ties in your data. My colleague informed me this was not mentioned in the book he had access to, and I have since checked 3 books that I have at home (haven’t checked the ones in my office, yet), and all of them give the above formula and make no mention of a correction-for-ties.
The interweb didn’t help much, either, in trying to work out what was going wrong, but I stumbled across a book chapter on Google books (link is here) which gives a corrected formula. To help others, it’s repeated below:
Wikipedia to the rescue (!)
It turns out that Wikipedia has an excellent page for Friedman’s test which provides yet another (set of) formula(s). I’m not quite sure the original source that Wikipedia got this equation from, but it turns out that this is a general formula that works for tied data and non-tied data alike. So, it makes sense that R might be using this formula (it would be more economical than having to continually test whether it should use the first or the second equation reported above).
What have I learned?
- Students are very observant. Had this student not contacted my colleague (and he not subsequently contact me), this error would have crept into next year, too. Genuine thanks to them for coming forward and questioning my lecture slides.
- Statistics is hard. How can we expect students to “get” all of this when I didn’t even know there was a mistake here.
- Most textbooks discussing this test don’t use tied data, and so use the first equation above. This does not work with tied data. You can use the second equation, but best use the generalised equation on Wikipedia.
- Wikipedia is (sometimes, at least!) accurate!
- R is still ace.
For those interested, here is the R code for performing the Friedman equations from Wikipedia. This is just to show what it’s doing; you can just see the R help pages for ?friedman.test for a shortcut to computing the test.
rankData <- matrix(c(2.5, 2.5, 1, 3, 1.5, 1.5, 1.5, 3, 1.5, 2, 3, 1, 1.5, 3, 1.5, 3, 1.5, 1.5, 2, 2, 2, 3, 1.5, 1.5, 2.5, 1, 2.5, 3, 1.5, 1.5, 2, 3, 1, 2.5, 2.5, 1), nrow = 12, byrow = TRUE, dimnames = list(1:12, c("None", "Classical", "Dance"))) rbar_none <- (1/12) * sum(rankData[, 1]) rbar_classical <- (1/12) * sum(rankData[, 2]) rbar_dance <- (1/12) * sum(rankData[, 3]) rbar <- (1/(3*12)) * sum(rankData) ssT <- 12* sum(((rbar_none - rbar)^2), ((rbar_classical - rbar) ^ 2), ((rbar_dance - rbar) ^ 2)) ssE <- (1/(12*2)) * sum(((rankData - rbar) ^ 2)) Q <- ssT/ssE