These rubrics apply to your entire submission.
To get these mechanics marks, you are expected to:
To get the marks for this writing component, you should:
These marks apply to the collective visuals you make in this assignment.
These marks apply to the collective code you write in this assignment.
For this exercise, we’ll use the river discharge (aka “flow”, units m^3/s) data of the Bow River at Banff, Alberta. We’ve included it in the
data folder of this lab, and is made available through the Water Survey of Canada. Your task is to investigate the river flow throughout the year – so, the response is river flow, and the (single) predictor is “day of the year” (also called “julian day”). Hint: use
lubridate::yday() to get the day of year from a date, and
lubridate::yday() to get the year.
You’ll be using non-parametric regression (i.e., no assumptions on the model function) to generate a variety of predictions (not just the mean). Unless specified otherwise, we’ll leave it up to you to decide exactly which local method to use, but here are two options:
Option 1: manual local method
Obtain your estimate at a particular day of year
d as follows:
If you want to make this more advanced (not required), you could:
Option 2: from R packages
There might be functions available in some R packages that will do non-parametric regression for you.
loess()for mean regression.
quantreg::rqss()(fits a smoothing spline). To use it, when specifying the formula, you need to wrap each predictor in the
qss()function, specifying the tuning parameter
lambdaalso within the
ensembleBMApackage might. I recommend just using a manual method for this one.
Using a local regression method, estimate two model functions: one for the median, and another for the mean flow for each day of the year. Plot these two model functions overtop of the data.
What are the error measurements of each model function? Just use the training data. Does it make sense to compare these two error measurements to see which one fits better? Why or why not?
When doing Exercise 1.2 below, use cross-validation to:
Please indicate whether you’ve tried this or not.
Fit a model function that represents the 0.75-quantile of river flow given the day of the year. Use some pre-programmed R function such as
quantreg::rqss() to fit a non-parametric method to plot a 0.75-quantile model function overtop of the data. Fit the model function by-eye (so that the model function is neither overfit nor underfit). Note, outside of this lab assignment, if you want to plot quantile model functions, you can use
method="rq" for linear, or
method="rqss" for non-parametric.
What is the estimated 0.75-quantile flow on day 160? Interpret this quantity. Why might the 0.75-quantile be desirable as a prediction instead of the mean or median?
What is the error of this model? To save time, it’s fine to calculate this on the training data (although, of course, it would be more ideal to get an estimate of generalization error).
Use a non-parametric regression method to provide a 90% prediction interval for flow on each day of the year. Do not make any distributional assumptions. Plot a 90% prediction band overtop of the data.
What is your prediction interval for flow on day 160? Provide an interpretation for the lower limit of the PI; then for the upper limit.
Use a local regression method to provide two probabilistic forecasts: one on day 160 (“peak flood season”), and another on day 300 (“low flood season”). Report your answer as a plot of two density estimates of both distributions.
Now, imagine you’re interested in setting up a campground along the Bow River at this gauge location (the town of Banff is currently located there, but imagine it isn’t). You’ve also learned that flows above 200 m^3/s typically result in a flood, with severity growing with larger flows. If you were to set up camp there, what does your predictive distribution tell you about the flooding situation you’d have to deal with year-in and year-out? Would you say setting up camp is a risky move?
Compute a probabilistic forecast for flow on day 160 using data outside of day 160. Use loess, using a gaussian or tricubic kernel weight function and local polynomials with degree either 1 or 2. Feel free to estimate either the cdf, quantile function, or density, but either way, plot your result AND your distribution from 1.4.
You might have noticed that the distribution of flow across day of the year is right-skewed. It turns out that the distribution of log(flow) across the year is far more symmetric – in fact, the distributions seem to resemble Gaussian distributions.
For this exercise, you’ll estimate mean and variance in 2.1, and then in 2.2, you’ll assume that log(flow) is Gaussian on any given day of the year, and will extract some probabilistic quantities using the assumption.
Note that you are still not making any assumptions on any model function!
First, obtain mean estimates of log(flow) for each day of the year using
loess() (again, fit this by eye – you’ll need a plot for this, which you’ll report at the end of this exercise (2.1)). Then, obtain variance estimates by fitting
loess() again, this time with the squared residuals as the response (again, fit this by eye). Remember, we can do this because the variance is also a mean (of the squared residuals)!
To show that you’ve estimated the mean and variance, display these on a plot of the data (remember, log scale) with a model function representing the mean, and 90% prediction bands based on the Gaussian distribution.
Now, assume that log(flow) on any given day of the year follows a Gaussian distribution. Equivalently, this means that flow follows a log normal distribution, denoted in R as
*lnorm() (such as
dlnorm()). Provide estimates of the following quantities of flow (not log(flow)!). Note that you should not have to fit any more models to answer this question.
In this exercise, we will work with the crime dataset found in the book “Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997)”. It has been made available online as a
.dta file at https://stats.idre.ucla.edu/stat/data/crime.dta. Let’s read it in:
crime <- read.dta("https://stats.idre.ucla.edu/stat/data/crime.dta")
We’ll investigate how the number of murders (per million, column
murder) influences the number of violent crimes (per million, column
crime). More documentation on the data can be found on this UCLA webpage.
Plot the data. Visually, are there any noticeable outliers in the predictor space, and/or outliers in the response? What makes this an outlier / these outliers?
Plot the Cook’s distances for each observation that results from fitting a linear regression with
lm(). You can do this with the
plot() function, specifying
which=4. Do the Cook’s distances agree with your speculations in 2.1?
Fit a robust linear regression using the
lmRob() function from the
robust package (which sits on top of the
robustbase package, which has a similar function called
lmrob()). Plot this on top of the data, along with the model functions for the standard linear regression, and linear median regression (you can use
ggplot2’s built-in functions for these, if you’d like).