# Solution to #BarBarPlots in R

I came across an interesting project the other day which is calling for a reconsideration of the use of bar plots (#barbarplots), with the lovely tag-line “Friends don’t let friends make bar plots!”. The project elegantly outlines convincing reasons why bar plots can be misleading, and have successfully funded a campaign to “…increase awareness of the limitations that bar plots have and the need for clear and complete data visualization”.

In this post, I want to show the limitations of bar plots that these scientists have highlighted. Then, I provide a solution to these limitations for researchers who want to continue using bar plots that can easily be cobbled together using R-statistics (with the ggplot2 package).

## The Data

Say you are a researcher who collects some data (it doesn’t matter on what) from two independent groups and you are interested in whether there is a difference between them. Most researchers would maybe calculate the mean and standard error of each group to describe the data. Then the researcher might plot the data using a bar plot, together with error bars representing the standard error. To provide an inferential test on whether a difference exists, the researcher would usually conduct an independent samples t-test.

Let’s provide some example data for two conditions:

• condition A (n = 100): mean of 200.17, a median of 196.43, and a standard error of 6.12
• condition B (n = 100): mean of 200.11, a median of 197.87, and a standard error of 7.19

Here is the bar plot:

Pretty similar, right? The researcher sees that there is little evidence for a difference; to test this inferentially they conduct an independent samples t-test, with the outcome t(198) = 0.007, p = .995, Cohen’s d < 0.001. The researcher concludes there is no difference between the two groups.

## The Problem

The problem raised by the #barbarplot campaign is that bar plots are a poor summary of the distribution of data. The bar plot above suggests there is no difference between the two groups, but the two groups are different! How do I know they are different? I simulated the data. What the bar plot hides is the shape of the underlying distribution of each data set. Below I present a density plot (basically a smoothed histogram) of the same data as above:

Now we can see that the two groups are clearly different! Condition A is a normal distribution, but condition B is bi-modal. The bar plot doesn’t capture this difference.

## The Solution

Density plots are a nice solution to presenting the distribution of data, but can get really messy when there are multiple conditions (imagine the above density plot but with 4 or more overlapping conditions). Plus, researchers are used to looking at bar plots, so there is something to be said about continuing their use (especially for factorial designs). But how do we get around the problem highlighted by the #barbarplot campaign?

One solution is to plot the bar plots as usual, but to overlay the bar plot with individual data points. Doing this allows the reader to see the estimates of central tendency (i.e., to interpret the bar plot as usual), whilst at the same time allowing the reader to see the spread of data in each condition. This sounds tricky to do (and it probably is if you are still using Excel; yes, I’m talking to you!), but it’s simple if you’re using R.

Below is the above data plotted as a combined bar and point plot. As you can see, the difference in distribution is now immediately apparent, whilst retaining the advantages of a familiar bar plot. Everyone wins!

## R Code

Below is the R code for the combined plot. This includes some code that generates the artificial data used in this example.

```#------------------------------------------------------------------------------
library(ggplot2)
library(dplyr)

#--- Generate artificial data

# set random seed so example is reproducible
set.seed(100)

# generate condition A
condition <- rep("condition_A", 100)
dv_A <- rnorm(100, 200, 60)
condition_A <- data.frame(condition, dv = dv_A)

# generate condition B
condition <- rep("condition_B", 100)
dv_B <- c(rnorm(50, 130, 10), rnorm(50, 270, 10))
condition_B <- data.frame(condition, dv = dv_B)

# put all in one data frame
raw_data <- rbind(condition_A, condition_B)

# calculate sumary statistics
data_summary <- raw_data %>%
group_by(condition) %>%
summarise(mean = mean(dv),
median = median(dv),
se = (sd(dv)) / sqrt(length(dv)))
#------------------------------------------------------------------------------

#------------------------------------------------------------------------------
#--- Do the "combined" bar plot
p2 <- ggplot()

# first draw the bar plot
p2 <- p2 + geom_bar(data = data_summary,
aes(y = mean,x = condition,
ymin = mean - se,
ymax = mean + se), fill = "darkgrey",
stat="identity", width=0.4)

# draw the error bars on the plot
p2 <- p2 + geom_errorbar(data = data_summary,
aes(y = mean, x = condition,
ymin = mean - se,
ymax = mean + se), stat = "identity",
width = 0.1, size = 1)

# now draw the points on the plot
p2 <- p2 + geom_point(data = raw_data, aes(y = dv, x = condition),
size = 3, alpha = 0.3,
position = position_jitter(width = 0.3, height = 0.1))

# scale and rename the axes, and make font size a bit bigger
p2 <- p2 + coord_cartesian(ylim = c(50, 400))
p2 <- p2 + scale_x_discrete(name = "Condition") +
scale_y_continuous(name = "DV")

p2 <- p2 + theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14,face = "bold"))

# view the plot
p2
#------------------------------------------------------------------------------
```