Making effective plots can tell you a LOT about data. Its hard! Its an under-rated but very powerful skill to develop.
Tips for effective graphing
Disclaimer: The tips you see here and online hold true for most cases. There might be some rare cases where the tips don’t hold – the key is to be intentional about every component of the graph.
gapxy <- ggplot(gapminder, aes(lifeExp, gdpPercap)) +
gapxy + geom_point()
gapxy <- gapxy + scale_y_log10()
gapxy + geom_point()
gapxy + geom_point(alpha=0.2)
gapxy + geom_hex()
gapxy + geom_density2d()
gapxy + facet_wrap(~continent) + geom_point(alpha=0.2)
ggplot(gapminder, aes(continent, lifeExp)) +
geom_violin(fill="red", alpha=0.2) +
geom_boxplot(fill="blue", alpha=0.2) +
Display just the right amount of content: not too much, not too little.
In particular: reveal as much relevant information as possible; trim irrelevant and redundant information.
Because hiding your data is not effective at conveying information!
Only use as much data as is required for answering a data analytic question.
Di Cook’s example in Rookie Mistakes: reduce complexity section.
Remove the special effects (sorry, Excel). Great demo: Darkhorse analytic’s Less is more gif.
More examples of extraneous information:
map_data("france") %>%
ggplot(aes(long, lat)) +
geom_polygon(aes(group=group), fill=NA, colour="black") +
theme_bw() +
ggtitle("Are lat and long really needed?")
ggplot(gapminder, aes(year, lifeExp)) +
geom_line(aes(group=country, colour=country), alpha=0.2) +
guides(colour=FALSE) +
theme_bw() +
ggtitle("Is colouring by country really necessary here?\nNevermind fitting the legend!")
Don’t redundantly map variables to aesthetics/facets.
HairEyeColor %>%
as_tibble() %>%
uncount(n) %>%
ggplot(aes(Hair)) +
facet_wrap(~Sex) +
geom_bar(aes(fill=Sex)) +
theme_bw() +
ggtitle("Don't do this.")
Really want to use colour? No problem, colours are fun! Try this:
HairEyeColor %>%
as_tibble() %>%
uncount(n) %>%
ggplot(aes(Hair)) +
facet_wrap(~Sex) +
geom_bar(fill="#D95F02") +
theme_bw() +
ggtitle("Do this.")
HairEyeColor %>%
as_tibble() %>%
uncount(n) %>%
count(Hair) %>%
ggplot(aes(Hair, n)) +
geom_col() +
geom_text(aes(label=n), vjust=-0.1) +
theme_bw() +
labs(x="Hair colour", y="count",
title="Are the bar numbers AND y-axis really needed?")
Here’s an iconic fail in Henrik Lindberg’s tweet: the “depeche plot”.
Don’t use pie charts – use bar charts instead. 3 reasons why they suck.
To bar, or not to bar? Not if zero doesn’t matter! As a general rule, I like to err on the side of points over bars.
plot_beav2 <- bind_rows(
mutate(beaver1, beaver = "Beaver 1"),
mutate(beaver2, beaver = "Beaver 2")
) %>%
group_by(beaver) %>%
summarize(med = median(temp)) %>%
ggplot(aes(beaver, med)) +
theme_bw() +
xlab("") +
ylab("Body Temperature\n(Celsius)")
plot_beav2 +
geom_col() +
ggtitle("Don't do this."),
plot_beav2 +
geom_point() +
ggtitle("Do this.")
(Yes, that’s really all the info you’re conveying. Own it.)
plot_iris <- ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
geom_jitter(aes(colour=Species)) +
theme_bw() +
theme(legend.position = "bottom")
plot_iris +
scale_colour_manual(values=c("brown", "gray", "yellow")) +
ggtitle("Don't do this."),
plot_iris +
scale_colour_brewer(palette="Dark2") +
ggtitle("Leave it to an expert.\nDo this.")
Are you comparing data across groups? Consider what a meaningful distance measure might be between two groups.
Are differences meaningful, and proportions not? Example: temperature. Zero doesn’t matter.
plot_beav <- bind_rows(
mutate(beaver1, beaver = "Beaver 1"),
mutate(beaver2, beaver = "Beaver 2")
) %>%
ggplot(aes(beaver, temp)) +
geom_violin() +
geom_jitter(alpha=0.25) +
theme_bw() +
xlab("") +
ylab("Body Temperature\n(Celsius)")
plot_beav +
plot_beav +
ylim(c(0,NA)) +
ggtitle("Not This.")
Are proportions meaningful, and differences not? Example: counts.
HairEyeColor %>%
as_tibble() %>%
uncount(n) %>%
ggplot(aes(Hair)) +
geom_bar() +
theme_bw() +
ggtitle("Keep this starting from 0.")
Want to convey absolute life expectancies, in addition to relative life expectancies? Show 0.
ggplot(gapminder, aes(continent, lifeExp)) +
geom_boxplot() +
ylim(c(0, NA)) +
geom_hline(yintercept = 0,
linetype = "dashed")
It’s easier to see rankings. See this STAT 545 example by Jenny Bryan. Use forcats