Due Tuesday, Nov 6 at 18:00 (although you should be able to finish in lecture)

```
suppressPackageStartupMessages(library(tidyverse))
```

In this course, we’ve looked at four layers of data vis:

- Drawing conclusions from visualizations
- Guiding question: what can we see in this graph that we can’t see by looking at a table of the raw data?

- Using tools to create a graph
`ggplot2`

and other niche R packages; also`matplotlib`

,`pandas`

, and`seaborn`

in Python.

- Components of a graph: the grammar of graphics, and theme elements.
- Effective graph choice

In this “half lab” assignment, you create a graph in a team of 2-3 that touches on all of these components, by choosing one of five analysis scenarios. Your submission will include the following parts:

- A graph.
- A figure caption.
- A description of your design choice.

The following section outlines the evaluation of this assignment. The scenarios follow.

rubric={mechanics:5}

To get the marks for tidy submission:

**INDICATE WHO YOUR HACKATHON TEAM MEMBERS ARE**! (Both names and github.ubc.ca usernames)- This document only contains instructions. Where you put your work is up to you, but it’s important for you to make it obvious where your work can be found!
- Use either jupyter notebook or R Markdown.
- Use either R or Python.
- Be sure to follow the general lab instructions.
- Do not include any code that installs packages (this is not good practice anyway).

rubric={writing:5}

To get these marks, you must use proper English, spelling, and grammar in your submission.

rubric={vis:10}

To get these marks, your graph must be publication quality.

rubric={reasoning:40}

To get these marks, briefly describe the decisions you made in designing this graph to be effective. Don’t write a lot. Only focus on the big picture things here. Point form is fine.

rubric={accuracy:15, quality:15}

To get these marks, your code must work (`accuracy`

rubric) and must be high quality (`quality`

rubric: readable and reasonably efficient).

To get these marks, you should accompany your graph with a figure caption that:

- Orients the reader to the graph (like a graph title would).
- Example: “Relationship between GDP per capita and life expectancy”

- Fills readers in with critical information that the graph could not convey. This might not be applicable, but it probably will be.
- Example: “taken every 5 years between 1952 and 2007”

- Draws a conclusion from the graph.
- Example: “countries with higher GDP per capita tend to have a higher life expectancy.”

Every day, you run a model that predicts tomorrow’s river discharge of the Bow River at Banff. On some days, the model produces an invalid forecast (this makes sense in the context of probabilistic forecasts, something you’ll learn in DSCI 562). You suspect that invalid forecasts (called `error`

in the data) are more likely to happen when the actual discharge that materializes the next day (called `outcome`

in the data) is larger. Your task is to produce a visual to explore this, while still giving the viewer a sense of the overall chance of an invalid forecast.

```
(sA <- read_csv("data/sA-error_prop.csv"))
```

```
## Parsed with column specification:
## cols(
## outcome = col_double(),
## error = col_logical()
## )
## # A tibble: 546 x 2
## outcome error
## <dbl> <lgl>
## 1 9.96 FALSE
## 2 9.98 FALSE
## 3 10.3 FALSE
## 4 10.3 FALSE
## 5 10.3 FALSE
## 6 10.4 FALSE
## 7 10.5 FALSE
## 8 10.5 FALSE
## 9 10.6 FALSE
## 10 10.6 FALSE
## # ... with 536 more rows
```

Suppose you have snowmelt (in millimeters) and river discharge (in m^3/s) data that are recorded daily, although some days have missing data. You’d like to show your audience *when* you have data on record, for both variables.

The data are given in two forms for your data wrangling convenience:

- In
`sBa`

, each row corresponds to an available observation of data type`Type`

on date`date`

. - In
`sBb`

, each row corresponds to an (inclusive) range of dates where data are available.

```
(sBa <- read_csv("data/sB-time_intervals-pointwise.csv"))
```

```
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## Type = col_character()
## )
## # A tibble: 23,070 x 2
## date Type
## <date> <chr>
## 1 1980-01-01 Discharge
## 2 1980-01-02 Discharge
## 3 1980-01-03 Discharge
## 4 1980-01-04 Discharge
## 5 1980-01-05 Discharge
## 6 1980-01-06 Discharge
## 7 1980-01-07 Discharge
## 8 1980-01-08 Discharge
## 9 1980-01-09 Discharge
## 10 1980-01-10 Discharge
## # ... with 23,060 more rows
```

```
(sBb <- read_csv("data/sB-time_intervals-range.csv"))
```

```
## Parsed with column specification:
## cols(
## beginning = col_date(format = ""),
## ending = col_date(format = ""),
## Type = col_character()
## )
## # A tibble: 89 x 3
## beginning ending Type
## <date> <date> <chr>
## 1 1980-01-01 2014-12-31 Discharge
## 2 1984-01-05 1984-02-15 Snowmelt
## 3 1984-02-28 1984-03-19 Snowmelt
## 4 1984-03-22 1984-06-15 Snowmelt
## 5 1984-06-17 1984-09-30 Snowmelt
## 6 1984-10-03 1984-10-10 Snowmelt
## 7 1984-10-13 1984-11-02 Snowmelt
## 8 1984-11-06 1984-11-21 Snowmelt
## 9 1984-11-24 1984-11-29 Snowmelt
## 10 1985-01-03 1985-01-31 Snowmelt
## # ... with 79 more rows
```

Suppose you’re writing a report to justify a prediction model you’ve built, which predicts river discharge (in m^3/s) using snowmelt (in millimeters). Part of your investigation involves trying out different (increasing) weight functions on snowmelt (you’ll learn more about weight functions later in the program, but their meaning is not important here). You want to convince your readers that your choice of weight functions do a fairly good job spanning the choices of sensible weight functions.

In other words, the weight functions should increase from 0 to 1 at different rates and at different positions, and this increase should happen in the range of sensible snowmelt values (snowmelt data are recorded in `sC-weight_justif.csv`

).

You’ve chosen nine weight functions. They’re stored in the list `wfun`

below. (If you’re interested, they are logistic functions with all combinations of three choices of the location parameter `x0`

, and three choices of the rate parameter `r`

. This information does not help you jusify your prediction model in this imaginary report.)

```
(sC <- read_csv("data/sC-weight_justif.csv"))
```

```
## Parsed with column specification:
## cols(
## snowmelt = col_double()
## )
## # A tibble: 658 x 1
## snowmelt
## <dbl>
## 1 3.01
## 2 11.3
## 3 13.4
## 4 11.4
## 5 1.72
## 6 5.36
## 7 4.21
## 8 0.223
## 9 0.719
## 10 4.86
## # ... with 648 more rows
```

```
## Define weight functions
logistic <- function(x0, r) function(x) 1 / (1 + exp(-r*(x-x0)))
wfun <- crossing(
x0 = c(0.0, 7.5, 15.0),
r = c(0.05, 0.20, 0.70)
) %>%
mutate(f = map2(x0, r, logistic)) %>%
`[[`("f")
```

You have daily snowmelt data (in millimeters) at a particular location near Canmore, Alberta. You notice that the distribution of snowmelt on a given day changes across the year, so you’ve estimated the distribution for every day of the year.

Someone’s given you many years worth of new data. Make a graph to show whether the new data “agrees” with your distribution estimates – that is, whether or not the data could have plausibly been drawn from the distributions you estimated. It would also be useful to show that your distribution estimates match up with the original data, too.

About the data:

`sD-dist_melt.csv`

has the snowmelt data, new and original.`sD-dist_melt.Rdata`

(too large to gift: find in students repo, or click here) loads a list with variable name`dist_melt`

containing the distributions. Density estimates are not available, but the cdf and quantile functions are, and are stored as list elements 1 and 2, respectively.- cdf arguments:
`x`

: vectorized. The variable of the distribution.`d`

: day of the year; integer acceptable from 1 to 366.

- quantile function arguments:
`p`

: vectorized. Probabilities to find quantiles of.`d`

: day of the year; integer acceptable from 1 to 366.

- cdf arguments:

```
(sF <- read_csv("data/sD-dist_melt.csv"))
```

```
## Parsed with column specification:
## cols(
## data = col_character(),
## date = col_date(format = ""),
## snowmelt = col_double()
## )
## # A tibble: 3,185 x 3
## data date snowmelt
## <chr> <date> <dbl>
## 1 original 1980-04-19 NA
## 2 original 1982-04-20 NA
## 3 original 1983-04-20 NA
## 4 original 1984-04-19 -2.33
## 5 original 1986-04-20 -0.984
## 6 original 1987-04-20 -3.45
## 7 original 1988-04-19 -0.772
## 8 original 1990-04-20 12.3
## 9 original 1991-04-20 2.53
## 10 original 1993-04-20 -2.15
## # ... with 3,175 more rows
```

```
load("data/sD-dist_melt.Rdata")
str(dist_melt)
```

```
## List of 2
## $ cdf:function (x, d)
## ..- attr(*, "srcref")= 'srcref' int [1:8] 27 16 43 9 16 9 27 43
## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7ffce8b852b8>
## $ qf :function (p, d)
## ..- attr(*, "srcref")= 'srcref' int [1:8] 52 11 60 5 11 5 52 60
## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7ffce8b852b8>
```