as a toolsuppressPackageStartupMessages(library(tidyverse))
knitr::opts_chunk$set(fig.width=5, fig.height=3)
From this lecture, students are expected to be able to:
What is data science?
Gaussian demo: what does independent bivariate Guassian look like? Let’s try different mean and sd’s.
n <- 100
x <- rnorm(n)
y <- rnorm(n)
qplot(x, y)
Generic independence:
n <- 100
# x <-
# y <-
qplot(x, y)
Practicalities of independence: x and y do not inform each other.
How do you answer the question “how much dependence is there?”
n <-
# x <-
# y <-
qplot(x, y)
Practicalities of dependence: each variable informs the other. Useful if we know something about one, and want to know something about the other (this is a HUGE part of data science!)
Normal scores plots are a useful way to standardize the viewing of dependence (amongst two variables).
.Function for Step 1:
nscore <- function(x) qnorm(ecdf(x)(x))
Convert the following scatterplot to a normal scores plot. Is there dependence?
n <- 1000
x <- rexp(n)
y <- rexp(n)
qplot(x, y)
Try making a normal scores plot of previous examples OR data examples:
EDA is not about fishing for meaning. Ask yourself this question: what does this graph show that we can’t see from viewing the raw data set?
ggplot(mtcars, aes(disp, carb)) +
geom_point(alpha=1/3) +
labs(x = "Displacement",
y = "Number of carburators")
Nothing going on in the above plot. End of story.
mtcars %>%
mutate(cyl = paste0(cyl, "-cylinder")) %>%
ggplot(aes(wt, hp)) +
geom_point() +
facet_wrap(~ cyl)
Nothing going on within each panel above. Though, more cylinders tend to have higher hp.
ggplot(iris, aes(Sepal.Width, Petal.Width)) +
Data fall into at least two groups. Explore further:
ggplot(iris, aes(Sepal.Width, Petal.Width)) +
Data fall into three groups! Is there dependence within each group? Check against normal scores plot:
ggplot(iris, aes(nscore(Sepal.Width), nscore(Petal.Width))) +
geom_jitter() +
facet_wrap(~ Species, scales="free")
Not much. Maybe some positive dependence in versicolor.
Let’s fill out the worksheet on the remaining ggplot2