suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(Lahman))

It’s common to “default” to using the mean to make decisions. But, the mean is not always appropriate (I wrote a blog post about this):

In these cases, we care about quantiles, not the mean. Estimating them is called quantile regression (as opposed to mean regression).

Recall what quantiles are: the \(\tau\)-quantile (for \(\tau\) between 0 and 1) is the number that will be exceeded by the outcome with a \((1-\tau)\) chance. In other words, there is a probability of \(\tau\) that the outcome will be below the \(\tau\)-quantile.

\(\tau\) is referred to as the quantile level, or sometimes the quantile index.

For example, a bus company might want to predict the 0.8-quantile of transit time – 80% of busses will get to their destination within that time.

What is the mean, anyway?

Imagine trying to predict your total expenses for the next two years. You have monthly expenses listed for the past 12 months. What’s one simple way of making your prediction? Calculate the average expense from the past 12 months, and multiply that by 24.

In general, a mean (or expected value) can be interpreted as the long-run average. However, the mean tends to be interpreted as a measure of central tendency, which has a more nebulous interpretation as a “typical” outcome, or an outcome for which most of the data will be “nearest”.

Linear Quantile Regression

The idea here is to model \[Q(\tau)=\beta_0(\tau) + \beta_1(\tau) X_1 + \cdots + \beta_p(\tau) X_p,\] where \(Q(\tau)\) is the \(\tau\)-quantile. In other words, each quantile level gets its own line, and are each fit independently of each other.

Here are the 0.25-, 0.5-, and 0.75-quantile regression lines for the baseball data:

dat <- Teams %>% tbl_df %>% 
  select(runs=R, hits=H)
ggplot(dat, aes(hits, runs)) +
    geom_point(alpha=0.1, colour="orange") +
    geom_quantile(colour="black") +
    theme_bw() +
    labs(x="Number of Hits (X)",
         y="Number of Runs (Y)")
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## Smoothing formula not specified. Using: y ~ x

I did this easily with ggplot2, just by adding a layer geom_quantile to my scatterplot, specifying the quantile levels with the quantiles= argument. We could also use the function rq in the quantreg package in R:

(fit_rq <- rq(runs ~ hits, data=dat, tau=c(0.25, 0.5, 0.75)))
## Call:
## rq(formula = runs ~ hits, tau = c(0.25, 0.5, 0.75), data = dat)
## 
## Coefficients:
##                tau= 0.25 tau= 0.50  tau= 0.75
## (Intercept) -118.8297872 8.2101818 64.0347349
## hits           0.5531915 0.4923636  0.4908592
## 
## Degrees of freedom: 2835 total; 2833 residual

If we were to again focus on the two teams (one with 1000 hits, and one with 1500 hits), we have (by evaluating the above three lines):

predict(fit_rq, newdata=data.frame(hits=c(1000, 1500)))
##   tau= 0.25 tau= 0.50 tau= 0.75
## 1  434.3617  500.5738  554.8940
## 2  710.9574  746.7556  800.3236

So, we could say that the team with 1000 hits:

amongst other things.

Exercise

Problem: Crossing quantiles

Because each quantile is allowed to have its own line, some of these lines might cross, giving an invalid result. Here is an example with the iris data set, fitting the 0.2- and 0.3-quantiles:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
    geom_point(alpha=0.25, colour="orange") +
    geom_quantile(aes(colour="0.2"), quantiles=0.2) +
    geom_quantile(aes(colour="0.3"), quantiles=0.3) +
    scale_colour_discrete("Quantile\nLevel") +
    theme_bw() +
    labs(x="Sepal Length",
         y="Sepal Width")
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

fit_iris <- rq(Sepal.Width ~ Sepal.Length, data=iris, tau=2:3/10)
b <- coef(fit_iris)
at8 <- round(predict(fit_iris, newdata=data.frame(Sepal.Length=8)), 2)

Quantile estimates of Sepal Width for plants with Sepal Length less than 7.3 are valid, but otherwise, are not. For example, for plants with a Sepal Length of 8, this model predicts 30% of such plants to have a Sepal Width of less than 2.75, but only 20% of such plants should have Sepal Width less than 2.82. This is an illogical statement.

There have been several “adjustments” proposed to ensure that this doesn’t happen (see below), but ultimately, this suggests an inadequacy in the model assumptions. Luckily, this usually only happens at extreme values of the predictor space, and/or for large quantile levels, so is usually not a problem.

Problem: Upper quantiles

Estimates of higher quantiles usually become worse for large/small values of \(\tau\). This is especially true when data are heavy-tailed.

Here is a histogram of 100 observations generated from a Student’s t(1) distribution (it’s heavy-tailed):

set.seed(4)
y <- rt(100, df=1)
qplot(y) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here are estimates of high and low quantiles, compared to the actual. You can see the discrepency grows quickly. Extreme-low quantiles are too high, whereas extreme-high quantiles are too low.

As a rule of thumb, it’s best to stay below \(\tau=0.95\) or above \(\tau=0.05\). If you really want estimates of these extreme quantiles, you’ll need to turn to Extreme Value Theory to make an assumption on the tail of the distribution of the data. One common approach is to fit a generalized Pareto distribution to the upper portion of the data, after which you can extract high quantiles.

Evaluating Model Goodness

The question here is: if we have two or more models that predicts the \(\tau\)-quantile, which model is best? We’ll need some way to score different models to do things such as:

**NOTE**: Mean Squared Error is not appropriate here!! This is very important to remember.

The reason is technical – the MSE is not a proper scoring rule for quantiles. In other words, the MSE does not elicit an honest prediction.

If we’re predicting the median, then the mean absolute error works. This is like the MSE, but instead of squaring the errors, we take the absolute value.

In general, a “correct” scoring rule for the \(\tau\)-quantile is as follows: \[ S = \sum_{i=1}^{n} \rho_{\tau}(Y_i - \hat{Q}_i(\tau)), \] where \(Y_i\) for \(i=1,\ldots,n\) is the response data, \(\hat{Q}_i(\tau)\) are the \(\tau\)-quantile estimates, and \(\rho_{\tau}\) is the check function (also known as the absolute asymmetric deviation function or tick function), given by \[ \rho_{\tau}(s) = (\tau - I(s<0))s \] for real \(s\). This scoring rule is negatively oriented, meaning the lower the score, the better. It cannot be below 0.

Here is a plot of various check functions. Notice that, when \(\tau=0.5\) (corresponding to the median), this is proportional to the absolute value:

base <- ggplot(data.frame(x=c(-2,2)), aes(x)) + 
    theme_bw() +
    labs(y=expression(rho)) +
    theme(axis.title.y=element_text(angle=0, vjust=0.5)) +
    ylim(c(0, 1.5))
rho <- function(tau) function(x) (tau - (x<0))*x
cowplot::plot_grid(
    base + stat_function(fun=rho(0.2)) + 
        ggtitle(expression(paste(tau, "=0.2"))),
    base + stat_function(fun=rho(0.5)) + 
        ggtitle(expression(paste(tau, "=0.5"))),
    base + stat_function(fun=rho(0.8)) + 
        ggtitle(expression(paste(tau, "=0.8"))),
    ncol=3
)
## Warning: Removed 4 rows containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_path).

For quantile regression estimation, we minimize the sum of scores instead of the sum of squared residuals, as in the usual (mean) linear regression.