Visualization Tip: Never Be Random

See that person’s legs sticking out of that washing machine? That’s random. When it comes to data visualization, don’t be like that person. Don’t be random.
data-visualization
transparency
tips-and-tricks
Author
Affiliation

Department of Statistics and Data Sciences, University of Texas at Austin

Published

Friday, September 27, 2024

Modified

Wednesday, October 30, 2024

Random legs.
Random legs.

Analyzing data is hard. Analyzing another person’s data analysis is even harder. Almost as hard as saying that last sentence out loud.

One way to make life easier for your audience — whether it be a reader of your article, someone watching your presentation, or anyone consuming your analysis — is to show your data and results visually.

I often come across visualizations that stop short of being maximally useful because some aspect of the plot is random. By random, I don’t mean random variability, which you do want to see in a plot. By “random”, I mean unplanned. Meaning some part of the plot could have been formatted with intent but was instead left to the devices of the plotting software defaults or the idiosyncrasies of the dataset.

With these examples, I hope to persuade you, kind visitor to my blog, that no part of a plot should be random.

Example 1: Average Heights of Star Wars Species

Say I wanted to explore the average heights (in centimeters) of the different species seen in the Star Wars films (using the starwars dataset from the dplyr R package). I decide to visualize the mean heights for each species.

Behold my plot below! See if you can spot the randomness:

Code
starwars %>% 
  filter(!is.na(species), !is.na(height)) %>% 
  ggplot(aes(x = species, y = height)) +
  geom_point(stat = "summary") +
  labs(x = "Species", y = "Height in Centimeters") +
  coord_flip() +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

The randomness is in the y-axis, which arranges the species alphabetically (this is the default of the ggplot() function). This kind of randomness is tricky to spot because alphabetical ordering is a pattern and therefore seems rational. But as the eye moves from the top to the bottom of the plot, it encounters randomness in the species heights — no fun indeed for our pattern-seeking eyeballs.

Behold: my plot again! This one orders the y-axis by height:

Code
starwars %>% 
  filter(!is.na(species), !is.na(height)) %>% 
  ggplot(aes(x = reorder(species, height), y = height)) +
  geom_point(stat = "summary") +
  labs(x = "Species", y = "Height in Centimeters") +
  coord_flip() +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

Isn’t this much easier to understand? The tallest species is at the top and the shortest is at the bottom — much more intuitive! 😎

Example 2: Gender Bias In College Admissions

Say I am analyzing data that some have claimed contains evidence of gender bias in college admissions (using the UCBAdmissions dataset, which has admission rates by gender and department at the University of California — Berkeley in 1973).

The first step in providing evidence of bias is to show that the rates of admission are indeed different for women compared to men. If the rates are the same, bias is ruled out. If there are differences, bias is possible.

At this stage, the audience just wants to know which departments have different rates of admission for men and women. Behold: my plot again! Can you spot the randomness?

Code
plot_data <- UCBAdmissions %>%
  as_tibble() %>%
  pivot_wider(names_from = Admit, values_from = n) %>% 
  mutate(
    Total = Admitted + Rejected,
    "Percent Admitted" = Admitted / Total * 100
    ) 

ggplot(plot_data, aes(x = Dept, y = `Percent Admitted`, fill = Gender)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  scale_fill_manual(values = colors[1:2]) +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

There are two sources of randomness:

  1. The y-axis truncates at 80% instead of 100% admission rate (this value is automatically chosen by ggplot() based on the characteristics of the data); and
  2. This one’s tricky: the order of the departments on the x-axis.

As for (1), why should the axis be truncated at 80% and not 90% or 99% or 100%? Conventional wisdom tells us to truncate the axis based on “what looks nice”. But is our research question, “what looks nice?” Nope! It’s, “are any gender differences in admission rates big enough to make us worried admissions are biased?”

If the axis is truncated, the differences will be visually inflated, and the audience will be more likely to falsely conclude that a trivial difference in admissions rate is actually worrisome. If, on the other hand, the difference in admission rates looks huge even with the full axis, then Berkeley’s lawyers will feel spontaneous neck pain.

Here’s the plot with the full y-axis:

Code
ggplot(
  plot_data, 
  aes(x = Dept,
      y = `Percent Admitted`, 
      fill = Gender)
  ) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format(scale = 1), limits = c(0, 100)) +
  scale_fill_manual(values = colors[1:2]) +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

Notice the difference between men and women is still present, but they appear smaller when considered against the full range of the possible percentages. Some will disagree, but I strongly recommend against truncated axes for any bounded variables (like percents, which range from 0 to 100 in this case) except in rare, extreme cases where you can — and hopefully do — provide an explicit rationale for why the axis is truncated1.

Now let’s consider randomness part (2), namely the order of the departments on the x-axis. The departments are again ordered alphabetically as per ggplot() defaults. The issue is that nothing about our question of gender bias has anything to do with the alphabetical ordering of the departments. This axis lacks intent, dammit! It can’t keep getting away with it!

Remember, the whole point of plotting gender by department is to help the audience see which departments are admitting men at higher rates than women (as opposed to the other way around — I’ll tackle that in a moment). This is easier to see if the x-axis is arranged so that the first department admits the most men relative to women, followed by the department that admits the second-most men relative to women, and so on. That’s a pattern the audience will probably look for, and they will like you more if you serve it up to them on a silver platter.

Here’s a second attempt:

Code
# Do some data wrangling magic
plot_data <- plot_data %>% 
  select(Gender, `Percent Admitted`, Dept) %>% 
  pivot_wider(names_from = Gender, values_from = `Percent Admitted`) %>% 
  mutate(Gender_Difference = Female - Male) %>% 
  pivot_longer(cols = Male:Female, names_to = "Gender", values_to = "Percent Admitted") 

ggplot(
  plot_data, 
  aes(x = reorder(Dept, Gender_Difference), # order the x axis by the Gender_Difference variable
      y = `Percent Admitted`, 
      fill = Gender)
  ) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format(scale = 1), limits = c(0, 100)) +
  scale_fill_manual(values = colors[1:2]) +
  labs(x = "Dept") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

In this new plot, Department E now stands out clearly as having the largest disparity between men and women (favoring men). The disparity gets smaller as you move from left to right, until the pattern reverses. For example, Department A admits women at a much higher rate than men.

What if we tweaked the research question from “which departments admit men at higher rates than women?” to “which departments admit either gender at a higher rate than the other?” In that case, I’ll order the departments by the size of the difference in the rates regardless of which gender was admitted more often:

Code
ggplot(plot_data, 
       aes(
         x = reorder(Dept, desc(abs(Gender_Difference))), # change the order to the absolute value of gender difference, use descending order to put largest difference on the left
         y = `Percent Admitted`, 
         fill = Gender)
       ) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format(scale = 1), limits = c(0, 100)) +
  scale_fill_manual(values = colors[1:2]) +
  labs(x = "Dept") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

Now, we can see that Department A has the largest difference in rates between genders (in favor of women) while Department B has the second largest difference (in favor of men). The further right you move, the more parity there is between genders. Department F has the most gender parity.

Digression: Never Use Bar Charts

Surprise! There was a third random thing about this plot: the choice to use a bar chart! Bamboozled again! Won’t you ever learn?!

It’s not a popular idea, but bar charts are terrible. They misleadingly show plot graphics (bars) where there is no data — in our case, the data is the department and the admission rate, the latter of which is a single value for each department for men and women.

But the bar covers all rates from 0 to the actual rate. This violates one of the few hard and fast rules of visualization: never visualize data that don’t exist!

Here’s one possible alternative you could use that more accurately draws the eye to the differences in rates by genders:

Code
ggplot(plot_data, 
       aes(
         x = reorder(Dept, 100 - abs(Gender_Difference)), # change the order to the absolute value of gender difference (subtract from 100 so that largest differences are first)
         y = `Percent Admitted`, 
         color = Gender,
         shape = Gender,
         group = Gender_Difference)
       ) +
  geom_line(color = "black", linetype = "dotted") +
  geom_point(size = 3) +
  scale_y_continuous(labels = percent_format(scale = 1), limits = c(0, 100)) +
  scale_color_manual(values = colors[1:2]) +
  labs(x = "Dept") + 
  coord_flip() +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

See the dotted black lines? The longer they are, the larger the difference in admission rates for that department. If viewing the above plot elicits a pleasant massaging feeling in your occipital lobe, that’s normal!

Here’s another alternative:

Code
set.seed(350)
plot_data %>% 
  mutate(
    label = round(`Percent Admitted`, 1),
    label = format(label, nsmall = 1),
    label = paste(label, "%", sep = "")
         ) %>% 
  ggplot(aes(x = Gender, y = `Percent Admitted`, color = Dept)) +
  geom_point(size = 3, shape = "circle", show.legend = FALSE) +
  geom_line(aes(group = Gender_Difference)) +
  geom_label_repel(aes(label = label), label.size = 0, show.legend = FALSE) +
    scale_y_continuous(labels = percent_format(scale = 1), limits = c(0, 100)) +
  scale_color_manual(values = colors) +
  labs(color = "Dept") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = NA, color = NA),
    plot.background = element_rect(fill = NA, color = NA),
    legend.background = element_rect(fill = NA, color = NA),
    legend.box.background = element_rect(fill = NA, color = NA)
  )

This one is nice because you don’t have to order the departments to see where the biggest differences are — they are biggest when the slopes are steepest. Credit to this post by Dr. Rebecca Barter for the idea!

Conclusion

Don’t be random! Plot your data with intent. Give yourself credit for taking pity on your audience by using a visualization, but don’t rest on your laurels. Look at your plot, think about what you want your audience to learn from it, and interrogate the defaults of your plotting software for randomness. Replace it with intent, and bask in the reflected glory of being an effective data visualizer.

Footnotes

  1. Note that truncation doesn’t apply to unbounded variables where the minimum and/or maximum values are determined by the data. Sometimes you do want to truncate — the rule is not “never truncate”, it’s just “always format your axes with intent”. As an example, let’s turn back to the starwars dataset. The heaviest species (Hutt) has an average weight of 1,500 kilograms while the next heaviest species (Kaleesh) is less than 250 kilograms. In this case, including Hutts might mislead the viewer into thinking that differences among other species are smaller than they actually are given Hutts are so unusually massive. In this case, we actually should truncate the axis — or better yet, exclude Hutts from the plot altogether:

    Code
    untruncated <- starwars %>% 
      filter(!is.na(species), !is.na(mass)) %>% 
      ggplot(aes(x = reorder(species, mass), y = mass)) + 
      geom_point(stat = "summary") + 
      labs(
        x = "Species", ,
        y = "Weight in Kilograms",
        title = "Untruncated X-Axis"
        ) + 
      coord_flip() + # flip x and y axes
      theme_minimal() + 
      theme(
        panel.background = element_rect(fill = NA, color = NA),
        plot.background = element_rect(fill = NA, color = NA),
        legend.background = element_rect(fill = NA, color = NA),
        legend.box.background = element_rect(fill = NA, color = NA),
        plot.title = element_text(hjust = 0)
      )
    truncated <- starwars %>% 
      filter(
        !is.na(species), 
        !is.na(mass), 
        species != "Hutt" # exclude hutts
        ) %>% 
      ggplot(aes(x = reorder(species, mass), y = mass)) + 
      geom_point(stat = "summary") + 
      labs(
        x = "Species", ,
        y = "Weight in Kilograms",
        title = "Axis Truncated By Excluding Hutts",
        caption = "Note: Hutt species was much heavier than\nall other species at 1,500 kg. To make \ndifferences among other species clearer, \nHutts are excluded from the plot."
        ) + 
      coord_flip() + # flip x and y axes
      theme_minimal() + 
      theme(
        panel.background = element_rect(fill = NA, color = NA),
        plot.background = element_rect(fill = NA, color = NA),
        legend.background = element_rect(fill = NA, color = NA),
        legend.box.background = element_rect(fill = NA, color = NA),
        plot.caption = element_text(hjust = 0),
        plot.title = element_text(hjust = 1)
      )
    untruncated + truncated &  
      plot_annotation(
        theme = theme(
          panel.background = element_rect(fill = NA, color = NA),
          plot.background = element_rect(fill = NA, color = NA)
        ))

    ↩︎

Citation

BibTeX citation:
@online{e._vanaman2024,
  author = {E. Vanaman, Matthew},
  title = {Visualization {Tip:} {Never} {Be} {Random}},
  date = {2024-09-27},
  url = {https://www.matthewvanaman.com/blog/never-be-random/},
  langid = {en}
}
For attribution, please cite this work as:
E. Vanaman, M. (2024, September 27). Visualization Tip: Never Be Random.