On the diversity of response time trimming methods

Jim Grange R, Response times June 3, 2014 8 Minutes

Below I outline an argument for moving towards a clearer, more objective, way to trim response times. I first discuss the importance of response time trimming, and then outline various methods commonly used by researchers. I then quantify the diversity of these methods by reviewing 3 years of articles from two prominent cognitive psychology journals, and catalogue usage of each method. I then suggest that a technique introduced by Van Selst and Jolicoeur (1994) might—and here I stress might—be a solution to the lack of objectivity in choosing which method to use. To aid its use, I provide R scripts for the trimming method by Van Selst and Jolicoeur.

I don’t usually intend to write posts as long as this, but the text below represents a small comment paper I have been trying—unsuccessfully—to publish. Rather than it sit in my drawer, I thought I would share it here. Comments welcomed.

Overview

Response times (RT) are an incredibly popular dependent variable in cognitive psychology, whereby researchers typically take a measure of central tendency of the distribution of total RTs for a given condition (often the mean, but sometimes the median) to infer the time-course of discrete psychological processes. The challenge facing researchers is how to best deal with so-called outliers: A small proportion of RTs that lie at the extremes of the RT distribution and thought to arise from processes not under investigation. These outliers can occur at the slow end of the distribution (e.g. due to distraction, lack of alertness etc.) or at the faster end (e.g. a rapid anticipatory response). As these outliers can influence the estimate of central tendency—and hence contaminate the estimate of the psychological process under investigation—researchers typically remove outliers (“trimming”) before conducting inferential analysis.

But what method should be used to identify outliers? This question turns out to be very challenging to answer (with no necessarily correct answer); as such, there exists a vast and diverse range of methods typically employed. Some statisticians in fact recommend not trimming RT data at all (see e.g. Ullrich & Miller, 1994). Alternatives include taking the median (which is less affected by extreme scores than the arithmetic mean), fitting a model to the entire RT distribution (Heathcote, Popiel, & Mewhort, 1991), analysing cumulative distribution frequencies (Ratcliff, 1979; see Houghton & Grange, 2011), or applying one of a class of process models of response time (e.g. Wagenmakers, 2009).

However, if RT trimming is to be used for calculation of mean RT, it is desirable that the method employed is as objective as possible. The inconsistency of possible methods—at best—leaves researchers in a quandary how best to process their data; at worst, it increases researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011): The increased flexibility in choosing which RT trimming method to use might increase false-positive rates.

The purpose of this post is to highlight the diversity in methods so that researchers are cognizant of the issue; I also propose that researchers consider establishing a standardised method of response time trimming using objective criteria. My issue is not with the trimming methods per se, but rather the potential for a lack of objectivity in selecting which method to use. In this day of increasing concern of replicability in psychological science, it is imperative to start the discussion regarding a uniform method of RT trimming. One candidate method was introduced by Van Selst and Jolicoeur (1994), but it is relatively complicated to implement. To facilitate its use, researchers have provided routines in proprietary software (e.g. SPSS; Thompson, 2006); in addition, I provide routines in the statistical package R (R Development Core Team, 2012) that implements this method.

Quantifying the Diversity

In an attempt to informally quantify the diversity of trimming methods employed, Table 1 catalogues the frequency of a number of diverse trimming methods reported in the 2010–2012 volumes of Journal of Experimental Psychology: Learning, Memory, & Cognition, and Quarterly Journal of Experimental Psychology. What strikes me is the sheer number of trimming-options researchers have to choose from.

No trimming is where no clear report of a trimming method could be found in the article. This, of course, does not necessarily mean that trimming was not employed, so is likely an over-estimate of the true number of studies that did not employ trimming. Absolute cut-off involves identifying an absolute upper- and lower-limit on RTs to include in the final analysis (e.g. “RTs faster than 200ms and slower than 2,000ms were excluded from data analysis”). Standard deviation (SD) trimming comes in different guises: Global SD trim removes any RTs that fall outside of a certain number of SDs from the global mean (i.e. across all participants and conditions; e.g. “RTs slower than 2.5 SD of the mean were excluded”); per cell SD trimming removes RTs outside of a certain number of SDs from the global mean of each experimental cell (“RTs slower than 2.5 SD of the mean of each experimental condition were excluded”); per participant trims RTs outside of certain number of SDs from the mean of each participant’s overall RT (“RTs slower than 2.5SD of the mean of each participant were excluded”); per cell, per participant is arguably more fair, as it trims RTs from all participants for all conditions, and hence will certainly trim from all experimental conditions (e.g. “RTs slower than 2.5 SD of the mean of each participant for each condition were excluded”).

Lack of Objectivity?

The main issue with all of the above trimming methods is their potential lack of objectivity; for example, when using an absolute cut-off, what criteria should one use for deciding on the upper limit? The choice from the articles reviewed showed that the choice ranged from 800 milliseconds (ms) to 10,000ms. Obviously, the choice will be influenced by the difficulty of the task, but even with a relatively simple task, how does one choose whether to use 2,000ms or 2,500ms as the upper limit? The lower limit is potentially simpler, as it defines the value below which responses were likely anticipatory (i.e. unrealistically fast); but even in this simpler case, there was a wide range of limits used, ranging from 50ms to 400ms. As such, a popular alternative to the absolute cut-off is to allow the data itself to identify outliers, by removing RTs above a certain number of standard deviations (SDs) above the mean. However, this process too might suffer from a lack of objectivity, as how does one decide on the SD value to use (2.5SDs or 3SDs?). In the articles reviewed, the SD chosen for the trimming ranged from 2 to 4.

A Potential Solution?

A strong candidate for an objective response time trimming method was introduced by Van Selst and Joliceur (1994). They noted that the outcome of many trimming methods is influenced by the sample size (i.e. number of trials) being considered, thus potentially producing bias. For example, even if RTs are drawn from identical positively-skewed distributions, a “per cell per participant” SD procedure would result in a higher mean estimate for a small sample size “condition” than a large sample size condition. This bias was shown to be removed when a “moving criterion” (MC) was used; this is where the SD used for trimming is dynamically adapted to the sample size being considered. This meets the criteria for objectivity in a trimming method as the SD used for calculating cut-off values is not determined by the researcher, but by the sample size under investigation. Thus, this method is an excellent candidate for a standardised trimming procedure.

Van Selst and Jolicoeur (1994) introduced two MC methods that reduced the bias with sample size: The non-recursive (MC) method removes any RT that falls outside a certain number of SDs from the mean (of the whole sample) being considered, with the value of SD being determined by the sample size of the distribution, with a lower SD value being used for smaller sample sizes (see Table 4 in Van Selst & Jolicoeur). The modified recursive (MC) procedure performs trimming in cycles. It first temporarily removes the slowest RT from the distribution; then, the mean of the sample is calculated, and the cut-off value is again calculated using a certain number of SDs around the mean, with the value for SD being determined by sample size (in this procedure, required SD decreases with increased sample size; see Van Selst & Jolicoeur for justification). The temporarily removed RT is then returned to the sample, and the fastest and slowest RTs are then compared to the cut-off, and removed if they fall outside. This process is then repeated until no outliers remain, or until the sample size drops below four. The SD used for the cut-off is thus dynamically altered based on the sample size of each cycle of the procedure. Van Selst and Jolicoeur reported slight opposing trends of these two methods, suggesting a “hybrid moving criterion” method (see their footnote 2, page 648) which simply takes the average of the non-recursive (MC) and modified recursive (MC) procedures.

Although the non-recursive (MC) procedure is relatively simple to implement with equal sample sizes between conditions and participants in standard software such as Excel, the modified recursive (MC) procedure and the hybrid present some technical challenges. Specifically, the modified recursive procedure requires many cycles of removing individual RTs, calculating means, establishing a dynamic SD criterion based on the current sample size on the current cycle, replacement of RTs, and trimming; the procedure must also be aware of the stopping rule when the sample drops below four.

Of course, the Van Selst and Jolicoeur (1994) method is just one possible approach, and the field might not reach a consensus as to which method could become the standard (or might not even want a consensus). As such, the field might continue (quite understandably) to use any one of a number of methods; but at the very least, I recommend that researchers should justify explicitly why they chose the method of RT trimming they did, and potentially demonstrate whether the pattern of results changes depending on the method employed. Such disclosure will allow readers to assess to what degree the results presented might be reliant on the trimming method chosen.

R Script for Implementing Van Selst & Jolicoeur (1994)

To facilitate implementation of this method for researchers, I provide routines in the statistical package R; this set of scripts is capable of executing all three of the methods recommended by Van Selst and Joliceour (1994), and also includes a “quick start guide” for users unfamiliar with R. The scripts can be downloaded from Github.

References

Heathcote, A., Popiel, S.J., & Mewhort, D.J.K. (1991). Analysis of response time distributions—An example using the Stroop task. Psychological Bulletin, 109, 340-347.

Houghton, G. & Grange, J.A. (2011). CDF-XL: Computing cumulative distribution frequencies of reaction time data in Excel. Behavior Research Methods, 43, 1023-1032.

Ratcliff, R. (1979). Group reaction time distributions and an analysis of distribution statistics. Psychological Bulletin, 86, 446-461.

R Development Core Team. (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Thompson, G.L. (2006). An SPSS implementation of the non-recursive outlier detection procedure with shifting z-score criterion (Van Selst & Jolicoeur, 1994). Behavior Research Methods, 38, 344-352.

Ulrich, R. & Miller, J. (1994). Effects of truncation on reaction time analysis. Journal of Experimental Psychology: General, 123, 34-80.

Van Selst, M. & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. Quarterly Journal of Experimental Psychology, 47 (A), 631-650.

Wagenmalers, E.-J. (2009). Methodological and empirical developments for the Ratcliff diffusion model of response times and accuracy. European Journal of Cognitive Psychology, 21,641-671.

Published by Jim Grange

I am a Lecturer (Assistant Professor equivalent) in cognitive psychology at Keele University UK. View all posts by Jim Grange

Published June 3, 2014

20 thoughts on “On the diversity of response time trimming methods”

Carla S. says:

November 5, 2014 at 16:53

Hi Jim,
I just wanted to say thanks for the post as it did make some things clear to me that were not so clear when I read Van Selst and Jolicoeur. I’m trying to figure out how to deal with outliers for my doctoral dissertation. One comment: I think your acronyms may be wrong. You use “MC” throughout when I think you want to use “NC” and “MC”, or maybe even, “MR” and “NR” if you want to be consistent with the acronyms used by Van Selst and Jolicoeur.
All the best,

Reply
1. Jim Grange says:
  
  December 12, 2014 at 08:21
  
  Hi Carla,
  
  Thanks very much for your message – sorry it’s taken so long to reply, my email didn’t notify me I had a message pending! I will check out the acronyms and change accordingly, thanks for pointing these errors out. I haven’t done much work on this lately, but I do plan a study trying to assess the effect of all trimming methods. The VS&J method is certainly objective (which is the nice feature), but it’s still not clear to me which method is “best” (but how would this be defined?). More and more, I have just been using no trimming and rather fitting models to all data (where the models allow for “noisy” outliers); I will likely continue this until I am happy I am using an objective and “correct” method.
  
  Good luck with the rest of your PhD!
  
  Jim.
  
  Reply
Pingback: trimr: An R Package of Response Time Trimming Methods | Jim Grange
Sam says:

August 25, 2015 at 20:55

I’m writing a paper in which we opted not to trim RTs at all, given some of the concerns you raise. We didn’t have RTs shorter than 200ms and there were theoretical reasons to expect very short and very long RTs. Just wondering if you know of empirical papers where no trimming was undertaken (which we could cite as precedents, in addition to providing our rationale)? Thanks!

Reply
1. Jim Grange says:
  
  August 26, 2015 at 08:23
  
  Hi, thanks for your comment. In a paper where we tested theoretical predictions of a computational model, we (Grange & Juvina, 2015) chose not to trim RTs, as we were interested also in RT-distribution analysis. I also did the same in another paper (Grange et al., 2012) which focussed on RT distribution analysis and diffusion modelling. Sorry to self-cite here, but I am not aware of many other people who have done it. The reviewers didn’t mention it, except in the Grange et al. (2012) paper, where we were asked to show that “standard” trimming produced qualitatively the same results (which we included as a comment in the results section).
  
  Hope that helps!
  
  References
  
  Grange, J.A. & Juvina, I. (2015). The effect of practice on n-2 repetition costs in set switching. Acta Psychologica, 154, 14-25.
  
  Grange, J.A., Lody, A., & Bratt, S. (2012). Cost-benefit and distributional analyses of accessory stimuli. Psychological Research, 76, 626-633
  
  Reply
  1. Sam says:
    
    August 26, 2015 at 12:56
    
    Thanks for your reply and suggestions! I will think about what to do. It might make sense to go ahead with the untrimmed RTs but then do the same as you did and report that trimming did not change the results. There may be more precedents for not trimming in the developmental literature that I’m not aware of.
Jim Grange says:

August 26, 2015 at 12:58

As there is so much diversity in the methods used, I think you could justify using just about any trimming criteria you wish (including none at all!).

Reply
Graduate Student says:

January 27, 2016 at 23:02

Hi Jim,

Your R script is very helpful – thank you. I have a question: the list of non-recursive criterion cutoffs in linearInterpolation.txt file for a sample size of 4 is 1.961, while the non-recursive criterion cutoffs listed in Van Selst and Jolicouer (1994) for a sample size of 4 is 1.458. I was wondering why that was? Also, was there a particular formula you used to derive your cutoff criterion?

Cheers,

Reply
Graduate Student says:

January 28, 2016 at 15:22

Hi Jim,

I added this comment yesterday, but I’m not sure if it got posted. I had a question about the cut-off criterion you used. In your linearinterpolation.txt file you have the cut-off criterion for a sample size of 4 at 1.961, while in Van Selst and Jolicoeur (1994) they have the cut-off criterion for a sample size of 4 at 1.458. I was wondering why that was?

Reply
1. Jim Grange says:
  
  January 28, 2016 at 22:29
  
  Hi there,
  
  Sorry for delay in responding. WordPress doesn’t automatically post comments, so I have to manually approve, and I didn’t realise I had a comment pending!
  
  Anyway, in response to your question, this was an error I had noticed quite some time ago. I have now produced an R package (available on CRAN here: https://cran.r-project.org/web/packages/trimr/) that does all trimming methods I mentioned in that post.
  
  You can see the code for the package here: https://github.com/JimGrange/trimr
  
  In the data folder, you will note the linearInterpolation.rda file now has correct cutoffs. For calculating the cutoffs, I used liner interpolation (as did Van Selst).
  
  Hope that helps.
  
  Cheers,
  Jim.
  
  Reply
  1. Jim Grange says:
    
    January 28, 2016 at 22:32
    
    …more specifically, I used the “approx” function in R: http://astrostatistics.psu.edu/su07/R/html/stats/html/approxfun.html
2. Jim Grange says:
  
  January 30, 2016 at 17:48
  
  Hi there – did you see my response? Best wishes.
  
  Reply
Tomás Palma says:

April 22, 2016 at 11:50

Hi Jim,

Great post and thanks for the R script. I have a question about the table 1 above. How did you get it? Can I find it in a paper? If not, can you recommend a paper that presents information about the methods that more commonly used?

Thanks!
Best,
Tomás

Reply
1. Jim Grange says:
  
  April 28, 2016 at 15:30
  
  Hi Tomas – thanks for your email, and apologies for delay. I tried to publish that post as a short commentary paper, but it was never accepted anywhere. I am not aware of other papers looking at this which is why I am interested in it.
  
  Cheers,
  Jim.
  
  Reply
  1. Tomás Palma says:
    
    April 28, 2016 at 16:14
    
    Hi Jim,
    
    Thanks for the reply. It is really a shame the paper was never accepted. I found it very very useful.
    
    Thanks again. And, keep posting stuff like this.
    
    Best,
    Tomás
Diana Orghian says:

June 9, 2016 at 19:43

Hi Jim,
I found this post very useful, thanks for all the effort.
I am using the R package you developed and it works nicely.
However in the final output it only gives me the mean per participant per condition. I was wandering whether I can get the raw trimmed data (the RTs per trial) ? I tried to change the code but I did not manage to put it working, any suggestions?
Thanks a lot.
Best,
Diana

Reply
1. Jim Grange says:
  
  June 12, 2016 at 16:38
  
  Hi Diana,
  
  Thanks for your message. I assume you mean for the recursive functions? For example, the SD trimming function allows you to return the raw data. But yes, you are correct, this is currently not possible for the recursive functions. It is on my “to-do” list to update this aspect of the package, and I am sure I will get to it at some point in the coming weeks. In the meantime, if it is a simple data set you are needing this trimming urgently for, you could shoot me an email and I could knock up a quick script that will return the raw data for you in the meantime.
  
  Cheers,
  Jim.
  
  Reply
  1. Diana Orghian says:
    
    June 14, 2016 at 21:06
    
    I really appreciate your help Jim.
    The script works like a charm!
    Cheers,
    Diana
tomstafford says:

March 29, 2020 at 09:01

Seems like Sam Parsons multiverse analysis of pre-processing choices aligns with that you are interested in https://twitter.com/Sam_D_Parsons/status/1196143190727938050

Reply
1. Jim Grange says:
  
  June 28, 2020 at 15:10
  
  Sorry Tom – Only just seen your comment (I’ve moved my blog to http://www.jimgrange.org now so I don’t check back here very often). Thanks for flagging this tweet.
  
  Reply