suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(Lahman))
Up until now, we’ve only seen different ways of using a predictor to give us more information the mean and mode of the response. The world holds a huge emphasis on the mean and mode, but these are not always what’s important. Two alternatives are:
The idea here is to put forth an entire probability distribution as a prediction.
Let’s look at an example. Suppose there are two baseball teams, one that gets 1000 total hits in a year, and another that gets 1500. Using “total hits in a year” as a predictor, we set out to predict the total number of runs of both teams. Here’s the top snippet of the data:
dat <- Teams %>% tbl_df %>%
select(runs=R, hits=H)
Let’s not concern ourselves with the methods yet. Using a standard regression technique, here are our predictions:
r <- 20
datsub <- filter(dat,
(hits>=1000-r & hits<=1000+r) |
(hits>=1500-r & hits<=1500+r)) %>%
mutate(approx_hits = if_else(hits>=1000-r & hits<=1000+r, 1000, 1500))
datsub %>%
group_by(approx_hits) %>%
summarize(expected_runs=round(mean(runs))) %>%
select(hits=approx_hits, expected_runs)
## # A tibble: 2 x 2
## hits expected_runs
## <dbl> <dbl>
## 1 1000 558
## 2 1500 768
Using a probabilistic forecast, here are our predictions:
Don’t you think this is far more informative than the mean estimates in the above table?
The probabilistic forecast/prediction contains the most amount of information about the response as possible (based on a set of predictors), because it communicates the entire belief of what \(Y\) values are most plausible, given values of the predictor.
Predictions/forecasts here are called predictive distributions.
From @gneiting_raftery:
Indeed, over the past two decades, probabilistic forecasting has become routine in such applications as weather and climate prediction (Palmer 2002; Gneiting and Raftery 2005), computational finance (Duffle and Pan 1997), and macroeconomic forecasting (Garratt, Lee, Pesaran, and Shin 2003; Granger 2006).
Let’s review how to estimate a univariate probability density function or probability mass function.
Here’s a random sample of 10 continuous variables, ordered from smallest to largest, stored in the variable x
:
Recall that we can use histograms to estimate the density of the data. The idea is:
ggplot(data.frame(x=x), aes(x)) +
geom_histogram(binwidth=10, center=min(x)+5,
fill="orange", colour="black") +
theme_bw()
(Note: this is not a true density, since the area under the curve is not 1, but the shape is what matters)
You’d have to play with the binwidth to get a histogram that looks about right (not too jagged, not too coarse). For the above example, there are too few data to make a good estimate. Let’s now generate 1000 observations, and make a histogram using qplot
from R’s ggplot2
package, with a variety of binwidths – too small, too large, and just right.
x <- rnorm(1000, sd=10)
qplot(x, binwidth=1) # too small
qplot(x, binwidth=10) # too big
qplot(x, binwidth=3.5) # just right
Advanced method: There’s a technique called the kernel density estimate that works as an alernative to the histogram. The idea is to put a “little mound” (a kernel) on top of each observation, and add them up. Instead of playing with the binwidth, you can play with the “bandwidth” of the kernels. Use geom="density"
in qplot
, and use bw
to play with the bandwidth:
qplot(x, geom="density", bw=2.5)
When the response is discrete (this includes categorical), the approach is simpler:
Here are ten observations, stored in x
:
x
## [1] 1 0 0 0 2 0 1 2 3 0
The proportions are as follows:
props <- tibble(Value=x) %>%
group_by(Value) %>%
summarize(Proportion=length(Value)/length(x))
You can plot these proportions with qplot
, specifying geom="col"
:
qplot(x=Value, y=Proportion, data=props, geom="col")
You can use ggplot2
to calculate the proportions, but it’s more complex. It’s easier to plot the raw counts, instead of proportions – and that’s fine, you’ll still get the same shape. Using qplot
again, let’s make a plot for 1000 observations (note that I indicate that my data are discrete by using the factor
function):
set.seed(2)
x <- rpois(1000, lambda=1)
qplot(factor(x))
Here’s the code to get proportions instead of counts:
qplot(factor(x), mapping=aes(y=..prop..), group=1)
The local methods and classification/regression trees that we’ve seen so far can be used to produce probabilistic forecasts. For local methods, let’s ignore the complications of kernel weighting and local polynomials. These methods result in a subset of the data, for which we’re used to taking the mean or mode. Instead, use the subsetted data to plot a distribution.
The above baseball example used a moving window with a radius of 20
hits. Visually, you can see the data that I subsetted within these two narrow windows, for hits of 1000 and 1500:
ggplot(dat, aes(hits, runs)) +
geom_point(colour="orange", alpha=0.1) +
geom_vline(xintercept=c(1000+c(-r,r), 1500+c(-r,r)),
linetype="dashed") +
theme_bw() +
labs(x="Number of Hits (X)",
y="Number of Runs (Y)")
Lahman
package, which contains the Teams
dataset.R
column).H
column) and 70 wins (W
column). Don’t forget to scale the predictors!Let’s examine the bias-variance / overfitting-underfitting tradeoff with kNN-based probabilistic forecasts. I’ll run a simulation like so:
Here are the 20 estimates for the values of \(k\). The overall mean of the distributions are indicated by a vertical dashed line.
Notice that:
A similar thing happens with a moving window, with the window width parameter. For tree-based methods, the amount that you partition the predictor space controls the bias-variance tradeoff.
To choose a balance between bias and variance, we need a measure of prediction goodness. When predicting the mean, the MSE works. When predicting the mode, the classification error works. But what works for probabilistic forecasts?
This is an active area of research. The idea is to use a proper scoring rule – a way of assigning a score based on the forecast distribution and the outcome only, that also encourages honesty. We won’t go into details – see [@gneiting_raftery] for details.
At the very least, one should check that the forecast distributions are “calibrated” – that is, the actual outcomes are spread evenly amongst the forecasts. You can check this by applying the forecast cdf to the corresponding outcome – the resulting sample should be Uniform(0,1). Note that this is built-in to at least some proper scoring rules.
For this course, we won’t be picky about how you choose your tuning parameters. Just look for a subset that you think has “enough” observations in it so that the distribution starts to take some shape, but not so much that it starts to shift.
For (1) and (2) below, you’re choosing between two candidates to hire. Discuss the pros and cons of choosing one candidate over the other in the following situations.
It’s hard to make a decision here. On the one hand, we can be fairly certain that the actual productivity of candidate A will be about 75, but there’s more of a gamble with candidate B. There’s a very real chance that B’s productivity is actually quite a bit higher than A – for example, a productivity of 80 is plausible for B, but not for A. On the other hand, there’s also a very real chance that B’s productivity is actually quite a bit lower than A, for the same reason. Your decision would depend on whether you would want to take a risk or not.
On the other hand, in reality, this is only one tool out of many other aspects of the candidate that you would consider. It might be a good idea to chat with B to get a better sense of what their productivity might actually be.
In this case, B is very very likely to have higher productivity than A, because all “plausible” productivity values for B are higher than all “plausible” productivity values of A.
Again, this is just one tool you might use to make a decision.
The forecast is biased, because the actual values are occuring near the upper tail of the distribution – they should be scattered about the middle, with a higher density of points occuring near 0. If using local methods, we’d have to reduce \(k\) or the window width to decrease bias (to remove “further” data that are less relevant); if using a tree-based method, you could grow the tree deeper to lower the bias.
Probabilistic forecasts are useful if you’re making a small amount of decisions at a time. For example:
But they are not appropriate when making decisions en-masse. For example: