Local Regression
By the end of this lecture, you should be able to:
Put aside regression techniques on the global conditioned mean and check local alternatives suitable for predictions!
\[ \mathbb{E}(Y_i \mid X_{i,j} = x_{i,j}) = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_k x_{i,k} \; \; \; \; \text{since} \; \; \; \; \mathbb{E}(\varepsilon_i) = 0. \]
\[\begin{equation*} C_0(X_i) = I(X_i < c_1) = \begin{cases} 1 \; \; \; \; \mbox{if $X_i < c_1$},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{equation*}\]
\[\begin{equation*} C_1(X_i) = I(c_1 \leq X_i < c_2) = \begin{cases} 1 \; \; \; \; \mbox{if $c_1 \leq X_i < c_2$},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{equation*}\]
\[\vdots\]
\[\begin{equation*} C_{q -1 }(X_i) = I(c_{q - 1} \leq X_i < c_q) = \begin{cases} 1 \; \; \; \; \mbox{if $c_{q-1} \leq X_i < c_q$},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{equation*}\]
\[\begin{equation*} C_q(X_i) = I(c_q \leq X_i) = \begin{cases} 1 \; \; \; \; \mbox{if $c_q \leq X_i$},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{equation*}\]
Heads-up: The step function \(C_0(X_i)\) does not appear in the model given that \(\beta_0\) can be interpreted as the response’s mean when \(X_i < c_1\) (i.e., the rest of the \(q\) step functions are equal to zero).
fat yields from a cow in kg/day, from week 1 to 35.
R!model_steps with formula = fat ~ steps via lm().Heads-up: Note that
estimateis the difference between each category with respect to the baseline[0.966,7.8).
Let us see what happens in this model:
If \(X_i < c_1\), then our model is: \(Y_i = \beta_0 + \beta_{q + 1}X_i\)
If \(c_1 \leq X_i < c_2\), then our model is: \(Y_i = \beta_0 + \color{red}{\beta_1} + (\beta_{q + 1} + \color{green}{\beta_{q + 2}}) X_i\)
If \(c_2 \leq X_i < c_3\), then our model is: \(Y_i = \beta_0 + \color{red}{\beta_2} + (\beta_{q + 1} + \color{green}{\beta_{q + 3}}) X_i\)
\[\vdots\]
So, for each interval, we have an \(\color{red}{\text{additional intercept}}\) and an \(\color{green}{\text{additional slope}}\).
Now, we will check if the model fitting looks better in fat_content.
formula, within lm(), is merely a model with interaction week * steps.R!The local regression lines are disconnected!
Can we fix that?
lm() function, within formula on the right-hand side along with the standalone regressor X, we have to add up the following term:I((X - c_j)*(X >= c_j)) by knot c_j.
fat_content (\(q = 4\)) and, then, use lm().fat_content.Heads-up: We have been using the letter \(k\) to denote the number of regressors in any regression model. Nevertheless, this letter is reserved for groups of \(k\) observations in this framework. Therefore, we will switch to the letter \(p\) to denote the number regressors in this framework.
\[ D^{(\text{Training, New})} = \sqrt{\sum_{j = 1}^p \Big( x_j^{\text{(Training)}} - x_j^{\text{(New)}} \Big)^2} \]
fat_content dataset.What are the consequences of overfitting a training set in further test sets?
A. There are no consequences at all; predictions will be highly accurate in any further test set.
B. The trained model will be so oversimplified that we will have a high bias in further test set predictions.
C. The trained model will be so overfitted that it will also explain random noise in training data. Therefore, we cannot generalize this model in further test set predictions.
week 10.week 10 (how many points is a parameter that we define).loess()).span, using WLS for a second degree polynomial for instance, we minimize the sum of squared errors considering the weight \(w_i\) as follows: \[\begin{equation*}
\sum_i w_i \left(y_i - \beta_0-\beta_1 x_i-\beta_2 x_i^2\right)^2
\end{equation*}\]Heads-up: Roughly speaking, WLS is used when the error variance differs across observations. If \(\mathrm{Var}(\varepsilon_i\mid x_i) = \sigma_i^2\), then (up to a constant factor) the standard choice is \(w_i \propto 1/\sigma_i^2\). Using these weights can improve efficiency and helps model heteroscedasticity.
Now, there are a couple of things to consider in loess():
span (between 0 and 1) defines the “size” of your neighbourhood. To be more exact, it specifies the proportion of points considered as neighbours of \(x\). The higher the proportion, the smoother the fitted surface will be.degree, specifies if you are fitting a constant (degree= 0), a linear model (degree = 1), or a quadratic model (degree = 2). By quadratic, we mean \(\beta_0+\beta_1 x_i+\beta_2 x_i^2\).
DSCI 562 - Regression II