I came across an interesting project the other day which is calling for a reconsideration of the use of bar plots (#barbarplots), with the lovely tag-line **“Friends don’t let friends make bar plots!”**. The project elegantly outlines convincing reasons why bar plots can be misleading, and have successfully funded a campaign to “…increase awareness of the limitations that bar plots have and the need for clear and complete data visualization”.

In this post, I want to show the limitations of bar plots that these scientists have highlighted. Then, I provide a solution to these limitations for researchers who want to continue using bar plots that can easily be cobbled together using R-statistics (with the ggplot2 package).

## The Data

Say you are a researcher who collects some data (it doesn’t matter on what) from two independent groups and you are interested in whether there is a difference between them. Most researchers would maybe calculate the mean and standard error of each group to describe the data. Then the researcher might plot the data using a bar plot, together with error bars representing the standard error. To provide an inferential test on whether a difference exists, the researcher would usually conduct an independent samples t-test.

Let’s provide some example data for two conditions:

- condition A (n = 100): mean of 200.17, a median of 196.43, and a standard error of 6.12
- condition B (n = 100): mean of 200.11, a median of 197.87, and a standard error of 7.19

Here is the bar plot:

Pretty similar, right? The researcher sees that there is little evidence for a difference; to test this inferentially they conduct an independent samples t-test, with the outcome *t*(198) = 0.007, *p* = .995, Cohen’s *d* < 0.001. The researcher concludes there is no difference between the two groups.

## The Problem

The problem raised by the #barbarplot campaign is that bar plots are a poor summary of the **distribution** of data. The bar plot above suggests there is no difference between the two groups, but the two groups **are** different! How do I know they are different? I simulated the data. What the bar plot hides is the shape of the underlying distribution of each data set. Below I present a density plot (basically a smoothed histogram) of the same data as above:

Now we can see that the two groups are **clearly** different! Condition A is a normal distribution, but condition B is bi-modal. The bar plot doesn’t capture this difference.

## The Solution

Density plots are a nice solution to presenting the distribution of data, but can get really messy when there are multiple conditions (imagine the above density plot but with 4 or more overlapping conditions). Plus, researchers are used to looking at bar plots, so there is something to be said about continuing their use (especially for factorial designs). But how do we get around the problem highlighted by the #barbarplot campaign?

One solution is to plot the bar plots as usual, but to overlay the bar plot with individual data points. Doing this allows the reader to see the estimates of central tendency (i.e., to interpret the bar plot as usual), whilst at the same time allowing the reader to see the spread of data in each condition. This sounds tricky to do (and it probably is if you are still using Excel; yes, I’m talking to **you**!), but it’s simple if you’re using R.

Below is the above data plotted as a combined bar and point plot. As you can see, the difference in distribution is now immediately apparent, whilst retaining the advantages of a familiar bar plot. Everyone wins!

## R Code

Below is the R code for the combined plot. This includes some code that generates the artificial data used in this example.

#------------------------------------------------------------------------------ # load required packages library(ggplot2) library(dplyr) #--- Generate artificial data # set random seed so example is reproducible set.seed(100) # generate condition A condition <- rep("condition_A", 100) dv_A <- rnorm(100, 200, 60) condition_A <- data.frame(condition, dv = dv_A) # generate condition B condition <- rep("condition_B", 100) dv_B <- c(rnorm(50, 130, 10), rnorm(50, 270, 10)) condition_B <- data.frame(condition, dv = dv_B) # put all in one data frame raw_data <- rbind(condition_A, condition_B) # calculate sumary statistics data_summary <- raw_data %>% group_by(condition) %>% summarise(mean = mean(dv), median = median(dv), se = (sd(dv)) / sqrt(length(dv))) #------------------------------------------------------------------------------ #------------------------------------------------------------------------------ #--- Do the "combined" bar plot p2 <- ggplot() # first draw the bar plot p2 <- p2 + geom_bar(data = data_summary, aes(y = mean,x = condition, ymin = mean - se, ymax = mean + se), fill = "darkgrey", stat="identity", width=0.4) # draw the error bars on the plot p2 <- p2 + geom_errorbar(data = data_summary, aes(y = mean, x = condition, ymin = mean - se, ymax = mean + se), stat = "identity", width = 0.1, size = 1) # now draw the points on the plot p2 <- p2 + geom_point(data = raw_data, aes(y = dv, x = condition), size = 3, alpha = 0.3, position = position_jitter(width = 0.3, height = 0.1)) # scale and rename the axes, and make font size a bit bigger p2 <- p2 + coord_cartesian(ylim = c(50, 400)) p2 <- p2 + scale_x_discrete(name = "Condition") + scale_y_continuous(name = "DV") p2 <- p2 + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14,face = "bold")) # view the plot p2 #------------------------------------------------------------------------------

Great stuff. Looks like position_jitterdodge() will also be useful here.

Ah, does that calculate the optimal jitter so the user doesn’t have to do it manually like I did?

maybe, but I looked it up for a 2×3 ANOVA plot. So it deals with x positions of different groups properly