DSCI_531_viz-1

Lecture 1 Worksheet

In this worksheet, we’ll be exploring various plot types (i.e., geometric objects), only using the x and y aesthetics (and group).

We’ll be jumping straight into the ggplot() function, instead of the more limited qplot() function.

  1. To get started, load the tidyverse and gapminder R packages.
suppressPackageStartupMessages(library(tidyverse))
library(gapminder)

Scatterplot

Let’s look at a scatterplot of gdpPercap vs. lifeExp.

  1. Fill out the grammar components below. Again, bold must be specified to make a ggplot2 plot.
    • We’ll ignore “coordinate system” and “facetting” after this.
Grammar Component Specification
data gapminder
aesthetic mapping x=lifeExp and y=gdpPercap
geometric object point
scale linear
statistical transform none
coordinate system rectangular/cartesian
facetting none
  1. Populate the data and aesthetic mapping in ggplot. What is returned? What’s missing?
ggplot(data=gapminder, mapping=aes(x=lifeExp, y=gdpPercap)) +
    geom_point()

ggplot(gapminder, aes(lifeExp, gdpPercap)) +
    geom_point()

ggplot(gapminder) +
    geom_point(aes(x=lifeExp, y=gdpPercap))

3. Add the missing component as a layer.

Notice the “metaprogramming” again!

  1. You must remember to put the aesthetic mappings in the aes function! What happens if you forget?
#ggplot(gapminder) +
#    geom_point(x = lifeExp, y = gdpPercap)
  1. Put the x-axis on a log scale, first by transforming the x variable.
    • Note: ggplot2 does some data wrangling and computations itself! We don’t always have to modify the data frame.
ggplot(gapminder, aes(lifeExp, gdpPercap)) +
    geom_point() +
    scale_y_log10()

ggplot(gapminder, aes(lifeExp, log(gdpPercap))) +
    geom_point()

6. Try again, this time by changing the scale (this way is better).

7. The aesthetic mappings can be specified on the geom layer if you want, instead of the main ggplot call. Give it a try:

  1. Optional: git stage and commit

Uses of a scatterplot:

Histograms, and Kernel Density Plots

Let’s build a histogram of life expectancy.

  1. Fill out the grammar components below. Again, bold must be specified to make a ggplot2 plot.
Grammar Component Specification
data gapminder
aesthetic mapping x=lifeExp, y=count (corrected from before)
geometric object histogram
scale x and y both linear
statistical transform count
  1. Build the histogram of life expectancy.
ggplot(gapminder, aes(lifeExp)) +
    geom_histogram(bins=50)

3. Change the number of bins to 50.

  1. Instead of a histogram, let’s create a kernel density plot.
ggplot(gapminder, aes(lifeExp)) +
    geom_density()

  1. Optional: git stage and commit

Uses of a histogram: Explore the distribution of a single numeric variable.

Box plots, and violin plots

Let’s make box plots of population for each continent. Note: y-axis is much better on a log scale!

  1. Fill out the grammar components below. Again, bold must be specified to make a ggplot2 plot.
Grammar Component Specification
data gapminder
aesthetic mapping x=continent, y=gdpPercap
geometric object boxplot OR violin
scale log-y; x is linear
statistical transform boxplot: 5 number summary; violinplot: density estimate
  1. Initiate the ggplot call, with the log y scale, and store it in the variable a. Print out a.
a <- ggplot(gapminder, aes(continent, gdpPercap)) +
    scale_y_log10()
  1. Add the boxplot geom to a.
a + geom_boxplot()

a + geom_point(alpha=0.2)

  1. A violin plot is a kernel density on its side, made symmetric. Add that geom to a.
    • What’s better here, boxplots or violin plots? Why?
a + geom_violin()

ggplot(gapminder, aes(continent, lifeExp)) +
    geom_violin()

  1. Optional: git stage and commit

Use of boxplot: Visualize 1-dimensional distributions (of a single numeric variable).

Jitter plots

Let’s work up to the concept of a jitter plot. As above, let’s explore the population for each continent, but using points (again, with the y-axis on a log scale).

Let’s hold off on identifying the grammar.

  1. Initiate the ggplot call to make a scatterplot of continent vs pop; initiate the log y scale. Store the call in the variable b.
b <- ggplot(gapminder, aes(continent, pop)) +
    scale_y_log10()
  1. Add the point geom to b. Why is this an ineffective plot?
b + geom_point()

  1. A solution is to jitter the points. Add the jitter geom. Re-run the command a few times – does the plot change? Why?
b + geom_jitter()

b + geom_violin() + geom_jitter(alpha=0.1)

  1. How does the grammar differ from a box plot or violin plot?
    • ANSWER:
  2. We can add multiple geom layers to our plot. Put a jitterplot overtop of the violin plot, starting with our base b. Try vice-versa.

  3. Optional: git stage and commit

Uses of jitterplot: Visualize 1-dimensional distributions, AND get a sense of the sample size.