Below I outline an argument for moving towards a clearer, more objective, way to trim response times. I first discuss the importance of response time trimming, and then outline various methods commonly used by researchers. I then quantify the diversity of these methods by reviewing 3 years of articles from two prominent cognitive psychology journals, and catalogue usage of each method. I then suggest that a technique introduced by Van Selst and Jolicoeur (1994) might—and here I stress might—be a solution to the lack of objectivity in choosing which method to use. To aid its use, I provide R scripts for the trimming method by Van Selst and Jolicoeur.
I don’t usually intend to write posts as long as this, but the text below represents a small comment paper I have been trying—unsuccessfully—to publish. Rather than it sit in my drawer, I thought I would share it here. Comments welcomed.
Response times (RT) are an incredibly popular dependent variable in cognitive psychology, whereby researchers typically take a measure of central tendency of the distribution of total RTs for a given condition (often the mean, but sometimes the median) to infer the time-course of discrete psychological processes. The challenge facing researchers is how to best deal with so-called outliers: A small proportion of RTs that lie at the extremes of the RT distribution and thought to arise from processes not under investigation. These outliers can occur at the slow end of the distribution (e.g. due to distraction, lack of alertness etc.) or at the faster end (e.g. a rapid anticipatory response). As these outliers can influence the estimate of central tendency—and hence contaminate the estimate of the psychological process under investigation—researchers typically remove outliers (“trimming”) before conducting inferential analysis.
But what method should be used to identify outliers? This question turns out to be very challenging to answer (with no necessarily correct answer); as such, there exists a vast and diverse range of methods typically employed. Some statisticians in fact recommend not trimming RT data at all (see e.g. Ullrich & Miller, 1994). Alternatives include taking the median (which is less affected by extreme scores than the arithmetic mean), fitting a model to the entire RT distribution (Heathcote, Popiel, & Mewhort, 1991), analysing cumulative distribution frequencies (Ratcliff, 1979; see Houghton & Grange, 2011), or applying one of a class of process models of response time (e.g. Wagenmakers, 2009).
However, if RT trimming is to be used for calculation of mean RT, it is desirable that the method employed is as objective as possible. The inconsistency of possible methods—at best—leaves researchers in a quandary how best to process their data; at worst, it increases researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011): The increased flexibility in choosing which RT trimming method to use might increase false-positive rates.
The purpose of this post is to highlight the diversity in methods so that researchers are cognizant of the issue; I also propose that researchers consider establishing a standardised method of response time trimming using objective criteria. My issue is not with the trimming methods per se, but rather the potential for a lack of objectivity in selecting which method to use. In this day of increasing concern of replicability in psychological science, it is imperative to start the discussion regarding a uniform method of RT trimming. One candidate method was introduced by Van Selst and Jolicoeur (1994), but it is relatively complicated to implement. To facilitate its use, researchers have provided routines in proprietary software (e.g. SPSS; Thompson, 2006); in addition, I provide routines in the statistical package R (R Development Core Team, 2012) that implements this method.
Quantifying the Diversity
In an attempt to informally quantify the diversity of trimming methods employed, Table 1 catalogues the frequency of a number of diverse trimming methods reported in the 2010–2012 volumes of Journal of Experimental Psychology: Learning, Memory, & Cognition, and Quarterly Journal of Experimental Psychology. What strikes me is the sheer number of trimming-options researchers have to choose from.
No trimming is where no clear report of a trimming method could be found in the article. This, of course, does not necessarily mean that trimming was not employed, so is likely an over-estimate of the true number of studies that did not employ trimming. Absolute cut-off involves identifying an absolute upper- and lower-limit on RTs to include in the final analysis (e.g. “RTs faster than 200ms and slower than 2,000ms were excluded from data analysis”). Standard deviation (SD) trimming comes in different guises: Global SD trim removes any RTs that fall outside of a certain number of SDs from the global mean (i.e. across all participants and conditions; e.g. “RTs slower than 2.5 SD of the mean were excluded”); per cell SD trimming removes RTs outside of a certain number of SDs from the global mean of each experimental cell (“RTs slower than 2.5 SD of the mean of each experimental condition were excluded”); per participant trims RTs outside of certain number of SDs from the mean of each participant’s overall RT (“RTs slower than 2.5SD of the mean of each participant were excluded”); per cell, per participant is arguably more fair, as it trims RTs from all participants for all conditions, and hence will certainly trim from all experimental conditions (e.g. “RTs slower than 2.5 SD of the mean of each participant for each condition were excluded”).
Lack of Objectivity?
The main issue with all of the above trimming methods is their potential lack of objectivity; for example, when using an absolute cut-off, what criteria should one use for deciding on the upper limit? The choice from the articles reviewed showed that the choice ranged from 800 milliseconds (ms) to 10,000ms. Obviously, the choice will be influenced by the difficulty of the task, but even with a relatively simple task, how does one choose whether to use 2,000ms or 2,500ms as the upper limit? The lower limit is potentially simpler, as it defines the value below which responses were likely anticipatory (i.e. unrealistically fast); but even in this simpler case, there was a wide range of limits used, ranging from 50ms to 400ms. As such, a popular alternative to the absolute cut-off is to allow the data itself to identify outliers, by removing RTs above a certain number of standard deviations (SDs) above the mean. However, this process too might suffer from a lack of objectivity, as how does one decide on the SD value to use (2.5SDs or 3SDs?). In the articles reviewed, the SD chosen for the trimming ranged from 2 to 4.
A Potential Solution?
A strong candidate for an objective response time trimming method was introduced by Van Selst and Joliceur (1994). They noted that the outcome of many trimming methods is influenced by the sample size (i.e. number of trials) being considered, thus potentially producing bias. For example, even if RTs are drawn from identical positively-skewed distributions, a “per cell per participant” SD procedure would result in a higher mean estimate for a small sample size “condition” than a large sample size condition. This bias was shown to be removed when a “moving criterion” (MC) was used; this is where the SD used for trimming is dynamically adapted to the sample size being considered. This meets the criteria for objectivity in a trimming method as the SD used for calculating cut-off values is not determined by the researcher, but by the sample size under investigation. Thus, this method is an excellent candidate for a standardised trimming procedure.
Van Selst and Jolicoeur (1994) introduced two MC methods that reduced the bias with sample size: The non-recursive (MC) method removes any RT that falls outside a certain number of SDs from the mean (of the whole sample) being considered, with the value of SD being determined by the sample size of the distribution, with a lower SD value being used for smaller sample sizes (see Table 4 in Van Selst & Jolicoeur). The modified recursive (MC) procedure performs trimming in cycles. It first temporarily removes the slowest RT from the distribution; then, the mean of the sample is calculated, and the cut-off value is again calculated using a certain number of SDs around the mean, with the value for SD being determined by sample size (in this procedure, required SD decreases with increased sample size; see Van Selst & Jolicoeur for justification). The temporarily removed RT is then returned to the sample, and the fastest and slowest RTs are then compared to the cut-off, and removed if they fall outside. This process is then repeated until no outliers remain, or until the sample size drops below four. The SD used for the cut-off is thus dynamically altered based on the sample size of each cycle of the procedure. Van Selst and Jolicoeur reported slight opposing trends of these two methods, suggesting a “hybrid moving criterion” method (see their footnote 2, page 648) which simply takes the average of the non-recursive (MC) and modified recursive (MC) procedures.
Although the non-recursive (MC) procedure is relatively simple to implement with equal sample sizes between conditions and participants in standard software such as Excel, the modified recursive (MC) procedure and the hybrid present some technical challenges. Specifically, the modified recursive procedure requires many cycles of removing individual RTs, calculating means, establishing a dynamic SD criterion based on the current sample size on the current cycle, replacement of RTs, and trimming; the procedure must also be aware of the stopping rule when the sample drops below four.
Of course, the Van Selst and Jolicoeur (1994) method is just one possible approach, and the field might not reach a consensus as to which method could become the standard (or might not even want a consensus). As such, the field might continue (quite understandably) to use any one of a number of methods; but at the very least, I recommend that researchers should justify explicitly why they chose the method of RT trimming they did, and potentially demonstrate whether the pattern of results changes depending on the method employed. Such disclosure will allow readers to assess to what degree the results presented might be reliant on the trimming method chosen.
R Script for Implementing Van Selst & Jolicoeur (1994)
To facilitate implementation of this method for researchers, I provide routines in the statistical package R; this set of scripts is capable of executing all three of the methods recommended by Van Selst and Joliceour (1994), and also includes a “quick start guide” for users unfamiliar with R. The scripts can be downloaded from Github.
Heathcote, A., Popiel, S.J., & Mewhort, D.J.K. (1991). Analysis of response time distributions—An example using the Stroop task. Psychological Bulletin, 109, 340-347.
Houghton, G. & Grange, J.A. (2011). CDF-XL: Computing cumulative distribution frequencies of reaction time data in Excel. Behavior Research Methods, 43, 1023-1032.
Ratcliff, R. (1979). Group reaction time distributions and an analysis of distribution statistics. Psychological Bulletin, 86, 446-461.
R Development Core Team. (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/.
Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.
Thompson, G.L. (2006). An SPSS implementation of the non-recursive outlier detection procedure with shifting z-score criterion (Van Selst & Jolicoeur, 1994). Behavior Research Methods, 38, 344-352.
Ulrich, R. & Miller, J. (1994). Effects of truncation on reaction time analysis. Journal of Experimental Psychology: General, 123, 34-80.
Van Selst, M. & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. Quarterly Journal of Experimental Psychology, 47 (A), 631-650.
Wagenmalers, E.-J. (2009). Methodological and empirical developments for the Ratcliff diffusion model of response times and accuracy. European Journal of Cognitive Psychology, 21,641-671.