Lecture 3: Feature engineering

Lecture 3: Feature engineering#

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

Imports and LO#

Imports#

import os
import sys

sys.path.append("code/.")
import matplotlib.pyplot as plt

%matplotlib inline
import mglearn
import numpy as np
import numpy.random as npr
import pandas as pd
from plotting_functions import *
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC

Learning outcomes#

From this lecture, students are expected to be able to:

Explain the importance of the quality of data.
Explain the importance of feature engineering when building machine learning models.
Explain the “column of ones” trick.
Explain the concept of polynomial feature transformations.
Apply linear models to non-linear datasets using polynomial features.
Explain the idea of feature crosses.
Identify when is it appropriate to apply discretization to numeric features.
Carry out preliminary feature engineering on text data using spaCy and nltk.

Feature engineering: Motivation#

❓❓ Questions for you#

iClicker Exercise 3.1#

iClicker cloud join link: https://join.iclicker.com/C0P55

Select the most accurate option below.

Suppose you are working on a machine learning project. If you have to prioritize one of the following in your project which of the following would it be?

(A) The quality and size of the data
(B) Most recent deep neural network model
(C) Most recent optimization algorithm

V’s answers: (A)

Garbage in, garbage out.#

Model building is interesting. But in your machine learning projects, you’ll be spending more than half of your time on data preparation, feature engineering, and transformations.
The quality of the data is important. Your model is only as good as your data.

Activity: How can you measure quality of the data? (~3 mins)#

Write some attributes of good- and bad-quality data in this Google Document.

What is feature engineering?#

Feature engineering is the process of determining which features might be useful in the model building and creating those features by transforming the given data or extracting them using alternative sources.

In 571 we talked about hyperparameter tuning, which is one way to get better model performance.
Another way is by changing the input representation.
Better representation: more flexibility, higher score, we can get by with simple and more interpretable models.
If your features, i.e., representation is bad, whatever fancier model you build is not going to help.

Some quotes on feature engineering#

A quote by Pedro Domingos A Few Useful Things to Know About Machine Learning

... At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.

A quote by Andrew Ng, Machine Learning and AI via Brain simulations

Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.

Better features usually help more than a better model.#

Good features would ideally:
- capture most important aspects of the problem
- allow learning with few examples
- generalize to new scenarios.
There is a trade-off between simple and expressive features:
- With simple features overfitting risk is low, but scores might be low.
- With complicated features scores can be high, but so is overfitting risk.

The best features may be dependent on the model you use.#

Examples:
- For counting-based methods like decision trees separate relevant groups of variable values
  - Discretization makes sense
- For distance-based methods like KNN, we want different class labels to be “far”.
  - Standardization
- For regression-based methods like linear regression, we want targets to have a linear dependency on features.

Domain-specific transformations#

In some domains there are natural transformations to do:

Spectrograms (sound data)
Wavelets (image data)
Convolutions

Source

In this lecture we’ll talk about the following:

Polynomial features (change of basis)
Feature engineering demos
- numeric data
- text data

Polynomial feature transformations for non-linear regression#

Linear regression prediction (recap)#

Interested in predicting a scalar valued target (e.g., housing price)
In DSCI 571, we talked about how to make predictions $\hat{y_i}$, which is the linear function of feature vector $x_i$ and weight vector $w$.

\[\hat{y_i} = w_0 + x_{i1}w_1 + x_{i2}w_2 + \ldots + x_{id}w_d\]

\[\hat{y_i} = w_0 + \sum_{j=1}^{d}x_{ij}w_j\]

$\hat{y_i} \rightarrow$ prediction for example $x_i$
$w \rightarrow$ weight vector
$w_0 \rightarrow$ bias term
$x_{ij} \rightarrow$ $j^{th}$ component of the feature vector $x_i$
$w_0, w_1, \ldots, w_d$ together are the parameters

Matrix vector notation#

The notation above is component-wise notation.
We also want to be able to write it as below so that we don’t have to write the summation all the time.
In matrix form, the expression for a linear model is:

\[\hat{y} = Xw\]

Where
- $\hat{y} \rightarrow$ prediction vector for feature matrix $X$
- $w \rightarrow$ weight vector

\[\begin{split} \begin{bmatrix}\hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_n}\end{bmatrix}_{n \times 1} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1d} \\ 1 & x_{21} & x_{22} & \ldots & x_{2d} \\ \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nd}\end{bmatrix}_{n \times (d+1)} \begin{bmatrix}w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_d\end{bmatrix}_{(d+1) \times 1} \end{split}\]

This notation is matrix vector notation. What happened to the bias term?
For simplicity, we rename the bias term as $w_0$ and introduce a dummy feature whose value is always 1.
So $w_0 + w_1x_{i1} + \dots + w_dx_{id}$ becomes $w_0x_{i0} + w_1x_{i1} + \dots + w_nx_{id}$, where $x_{i0}$ is always 1.
This is often referred to as “column of ones” trick.

An example of column of ones notation#

Suppose $X$ has only one feature: $$X = \begin{bmatrix}0.86 \\ 0.02 \\ -0.42 \end{bmatrix}$$
Make a new matrix $Z$ with an extra feature whose value is always 1. $$Z = \begin{bmatrix}1 & 0.86\\ 1 & 0.02 \\ 1 & -0.42\\ \end{bmatrix}$$

So far we assumed that we magically learned the parameters $w$.
The most common way to learn these parameters in linear regression is by minimizing the quadratic cost between the actual target $y$ and the model predictions $\hat{y}$.

\[J(w) = \sum_{i=1}^n (y_i - x_i^Tw)^2 = (y - Xw)^T(y-Xw)\]

This is called ordinary least squares, the most commonly used loss function or cost function for linear regression.
More on this later.

Limitations of linear regression#

Linear models are fast and scalable.
But they might seem rather limited, especially in low-dimensional spaces because they only learn lines, planes, or hyperplanes.
What if the true relationship between the target and the features is non-linear?
Can we still use ordinary least squares to fit non-linear data?
One way to make linear models more flexible is using feature mappings or transformations!

Polynomial transformations#

Let’s consider this synthetic toy data with only one feature.

from matplotlib.pyplot import figure

np.random.seed(10)
n = 20
X = np.linspace(-3, 3, n)
y = X**2 + npr.randn(n)
X_toy = X[:, np.newaxis]
y_toy = y[:, np.newaxis]
pd.DataFrame(np.hstack([X_toy, y_toy]), columns=["feat1", "y"])

	feat1	y
0	-3.000000	10.331587
1	-2.684211	7.920265
2	-2.368421	4.064018
3	-2.052632	4.204913
4	-1.736842	3.637956
5	-1.421053	1.299305
6	-1.105263	1.487118
7	-0.789474	0.731817
8	-0.473684	0.228668
9	-0.157895	-0.149669
10	0.157895	0.457957
11	0.473684	1.427414
12	0.789474	-0.341797
13	1.105263	2.249881
14	1.421053	2.248021
15	1.736842	3.461758
16	2.052632	3.076694
17	2.368421	5.744555
18	2.684211	8.689523
19	3.000000	7.920195

Let’s plot the data

figure(figsize=(6, 4), dpi=80)
plt.scatter(X_toy[:, 0], y_toy, s=50, edgecolors=(0, 0, 0))
plt.xlabel("feat1")
plt.ylabel("y");

../_images/f30faff9a672da6eeaf8416cd97fc2b903e584118b60308176d99ec7f651d028.png

Can simple linear regression fit this data?
Let’s try it out.
- Right now we are focussing on creating a better fit on the training data. So we are skipping splitting the data for the demonstration purpose.
- Also, for simplicity, we are using LinearRegression, which doesn’t have any hyperparameter controlling the complexity of the model.

# Fit a regression line.
lr = LinearRegression()
lr.fit(X_toy, y_toy)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

figure(figsize=(6, 4), dpi=80)
plt.scatter(X_toy[:, 0], y, s=50, edgecolors=(0, 0, 0))
preds = lr.predict(X_toy)
plt.xlabel("feat1")
plt.ylabel("y")
plt.plot(X_toy, preds, color="red", linewidth=2);

../_images/e7d0204a645712ce79c3ef51dbc09a2f6952adea3559804ce41d059c03095c6e.png

As expected, the regression line is unable to capture the shape of the data.
The model is underfit.
The score on the training data is close to a dummy model.

lr.score(X_toy, y_toy)

0.0002572570295679144

DummyRegressor().fit(X_toy, y_toy).score(X_toy, y_toy)

0.0

Adding quadratic features#

It looks like a quadratic function would be more suitable for this dataset.
A linear model on its own cannot fit a quadratic function. But what if we augment the data with a new quadratic feature?
Let’s try it out.

# add a squared feature
X_sq = np.hstack([X_toy, X_toy**2])
pd.DataFrame(X_sq, columns=["feat1", "feat1^2"])

	feat1	feat1^2
0	-3.000000	9.000000
1	-2.684211	7.204986
2	-2.368421	5.609418
3	-2.052632	4.213296
4	-1.736842	3.016620
5	-1.421053	2.019391
6	-1.105263	1.221607
7	-0.789474	0.623269
8	-0.473684	0.224377
9	-0.157895	0.024931
10	0.157895	0.024931
11	0.473684	0.224377
12	0.789474	0.623269
13	1.105263	1.221607
14	1.421053	2.019391
15	1.736842	3.016620
16	2.052632	4.213296
17	2.368421	5.609418
18	2.684211	7.204986
19	3.000000	9.000000

Let’s plot our augmented data along with y.

plot_3d_reg(X_sq, y)  # user-defined function from plotting_functions.py
plot_3d_reg(X_sq, y, surface=True)

../_images/bdef8ee453b35d3eb1a7fdb2dd80584a846bc235163be582a12d8eb5d0705d34.png

../_images/f09095511e3e3e737bdb42c7094289f057a72e3b3962520593d75347b9138cc7.png

A linear model fits well on this augmented data now!!
This is a common way to make linear models more flexible by adding more degrees of freedom.
The model is still linear, i.e., it’s still learning the coefficients for each feature. But the feature space is augmented now.

lr = LinearRegression()
lr.fit(X_sq, y_toy)  # Linear regression with augmented data
lr.score(X_sq, y_toy)  # The scores are much better now

0.9270602202765702

pd.DataFrame(
    lr.coef_.transpose(), index=["feat1", "feat1^2"], columns=["Feature coefficient"]
)

	Feature coefficient
feat1	-0.027196
feat1^2	1.006051

According to the model, our newly created feat1^2 feature is the most important feature for prediction; the coefficient of the squared feature has the biggest magnitude.
The idea of transforming features and creating new features is referred to as change of basis.

Polynomial regression in `sklearn`#

In sklearn we can add polynomial features using sklearn’s PolynomialFeatures
Using this we can generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to a specified degree.
PolynomialFeatures is a transformer.
For one-dimensional feature vector [a], the augmented features with degree 2 polynomial would be: $[1, a, a^2]$.
For two-dimensional feature vector [a, b], the augmented features with degree 2 polynomial would be: $[1, a, b, a^2, ab, b^2]$.

Let’s try polynomial features with degree 2 and visualize augmented features.

from sklearn.preprocessing import PolynomialFeatures

deg = 2
poly_feats = PolynomialFeatures(degree=deg)
X_enc = poly_feats.fit_transform(X_toy)
pd.DataFrame(X_enc, columns=poly_feats.get_feature_names_out()).head()

	1	x0	x0^2
0	1.0	-3.000000	9.000000
1	1.0	-2.684211	7.204986
2	1.0	-2.368421	5.609418
3	1.0	-2.052632	4.213296
4	1.0	-1.736842	3.016620

Let’s fit linear regression on the transformed data.

lr_poly = LinearRegression()
lr_poly.fit(X_enc, y_toy)
preds = lr_poly.predict(X_enc)
lr_poly.score(X_enc, y_toy)

0.9270602202765702

The model is not underfit anymore. The training score is pretty good!
Let’s examine the coefficients.

We can also plot the prediction from the augmented 3-D space in the original 2-D space.

figure(figsize=(6, 4), dpi=80)
plt.scatter(X_toy[:, 0], y_toy, s=50, edgecolors=(0, 0, 0))
plt.plot(X, preds, color="green", linewidth=1);

../_images/5507e27137f719c17b3f6e8bdabddf6ec714af2d55f50e5acfa548d5243ebc8e.png

Now the fit is much better compared to linear regression on the original data!
What’s happening here?
The actual linear regression model is fit in the augmented space and making the prediction in that augmented space.

Let’s try degree=20 polynomial.

deg = 20
poly_feats = PolynomialFeatures(degree=deg)
X_enc = poly_feats.fit_transform(X_toy)
pd.DataFrame(X_enc, columns=poly_feats.get_feature_names_out()).head(10)

	1	x0	x0^2	x0^3	x0^4	x0^5	x0^6	x0^7	x0^8	x0^9	...	x0^11	x0^12	x0^13	x0^14	x0^15	x0^16	x0^17	x0^18	x0^19	x0^20
0	1.0	-3.000000	9.000000	-27.000000	81.000000	-243.000000	729.000000	-2187.000000	6.561000e+03	-1.968300e+04	...	-1.771470e+05	5.314410e+05	-1.594323e+06	4.782969e+06	-1.434891e+07	4.304672e+07	-1.291402e+08	3.874205e+08	-1.162261e+09	3.486784e+09
1	1.0	-2.684211	7.204986	-19.339700	51.911825	-139.342268	374.023983	-1003.959113	2.694838e+03	-7.233512e+03	...	-5.211735e+04	1.398939e+05	-3.755048e+05	1.007934e+06	-2.705507e+06	7.262150e+06	-1.949314e+07	5.232369e+07	-1.404478e+08	3.769915e+08
2	1.0	-2.368421	5.609418	-13.285464	31.465573	-74.523727	176.503563	-418.034755	9.900823e+02	-2.344932e+03	...	-1.315370e+04	3.115351e+04	-7.378462e+04	1.747531e+05	-4.138888e+05	9.802630e+05	-2.321675e+06	5.498705e+06	-1.302325e+07	3.084454e+07
3	1.0	-2.052632	4.213296	-8.648345	17.751867	-36.438042	74.793875	-153.524271	3.151288e+02	-6.468433e+02	...	-2.725342e+03	5.594124e+03	-1.148268e+04	2.356970e+04	-4.837991e+04	9.930614e+04	-2.038389e+05	4.184062e+05	-8.588338e+05	1.762869e+06
4	1.0	-1.736842	3.016620	-5.239393	9.099999	-15.805262	27.451244	-47.678477	8.280999e+01	-1.438279e+02	...	-4.338741e+02	7.535708e+02	-1.308834e+03	2.273237e+03	-3.948254e+03	6.857494e+03	-1.191038e+04	2.068646e+04	-3.592911e+04	6.240319e+04
5	1.0	-1.421053	2.019391	-2.869660	4.077938	-5.794965	8.234950	-11.702298	1.662958e+01	-2.363151e+01	...	-4.772125e+01	6.781441e+01	-9.636784e+01	1.369438e+02	-1.946043e+02	2.765430e+02	-3.929821e+02	5.584483e+02	-7.935844e+02	1.127725e+03
6	1.0	-1.105263	1.221607	-1.350197	1.492323	-1.649409	1.823031	-2.014930	2.227027e+00	-2.461451e+00	...	-3.006925e+00	3.323444e+00	-3.673280e+00	4.059941e+00	-4.487303e+00	4.959651e+00	-5.481719e+00	6.058742e+00	-6.696505e+00	7.401400e+00
7	1.0	-0.789474	0.623269	-0.492054	0.388464	-0.306682	0.242117	-0.191145	1.509042e-01	-1.191349e-01	...	-7.425304e-02	5.862082e-02	-4.627960e-02	3.653652e-02	-2.884462e-02	2.277207e-02	-1.797795e-02	1.419312e-02	-1.120509e-02	8.846127e-03
8	1.0	-0.473684	0.224377	-0.106284	0.050345	-0.023848	0.011296	-0.005351	2.534611e-03	-1.200605e-03	...	-2.693878e-04	1.276048e-04	-6.044436e-05	2.863154e-05	-1.356231e-05	6.424252e-06	-3.043067e-06	1.441453e-06	-6.827933e-07	3.234284e-07
9	1.0	-0.157895	0.024931	-0.003936	0.000622	-0.000098	0.000015	-0.000002	3.863147e-07	-6.099706e-08	...	-1.520702e-09	2.401109e-10	-3.791224e-11	5.986144e-12	-9.451806e-13	1.492390e-13	-2.356406e-14	3.720641e-15	-5.874696e-16	9.275836e-17

10 rows × 21 columns

lr_poly = LinearRegression()
lr_poly.fit(X_enc, y)
preds = lr_poly.predict(X_enc)
figure(figsize=(6, 4), dpi=80)
plt.scatter(X_toy[:, 0], y_toy, s=50, edgecolors=(0, 0, 0))
plt.plot(X_toy, preds, color="green", linewidth=1);

../_images/2ef9d7254b15c508ca63ec7aa3fdd1754f181d4141f5dbf0e0f503a502ac01c2.png

It seems like we are overfitting now.
The model is trying to go through every training point.
The model is likely to overfit on unseen data.
You can pick the degree of polynomial using hyperparameter optimization.

Interim summary#

We can make linear models more flexible by augmenting the feature space.
One way to do it is by applying polynomial transformations.
Example: Suppose $X$ has only one feature, say $f_1$$

\[\begin{split}X = \begin{bmatrix}0.86 \\ 0.02 \\ -0.42 \end{bmatrix}\end{split}\]

We can add a new feature $f_1^2$?

Our $Z$ will have three features $f_0$, $f_1$, $f_1^2$ with polynomial with degree 2. $$Z = \begin{bmatrix}1 & 0.86 & 0.74\\ 1 & 0.02 & 0.0004\\ 1 & -0.42 & 0.18\\ \end{bmatrix}$$

$Z$ $\rightarrow$ augmented dataset with quadratic features
fit: We fit using $Z$ and learn weights $v$.
predict: When we predict, we need to apply the same transformations on the test example and add these features in the test example and predict using learned weights $v$.

$\hat{y}$ is still a linear function of $v$ and $Z$.

`PolynomialFeatures` with sklearn pipelines#

degree = 20
pipe_poly = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())

pipe_poly.fit(X_toy, y_toy)
preds = pipe_poly.predict(X_toy)
figure(figsize=(6, 4), dpi=80)
plt.scatter(X_toy[:, 0], y_toy, s=50, edgecolors=(0, 0, 0))
plt.plot(X, preds, color="green", linewidth=1);

The model has learned coefficients for the transformed features.

pd.DataFrame(
    pipe_poly.named_steps["linearregression"].coef_.transpose(),
    index=pipe_poly.named_steps["polynomialfeatures"].get_feature_names_out(),
    columns=["Feature coefficients"],
).sort_values("Feature coefficients", ascending=False)

	Feature coefficients
x0^7	84.271533
x0^11	31.398818
x0^6	29.056499
x0^12	16.435514
x0^2	9.626527
x0^8	8.711761
x0^3	3.763840
x0	1.856515
x0^15	1.138977
x0^16	0.755986
x0^19	0.002640
x0^20	0.001922
1	-0.000220
x0^18	-0.060286
x0^17	-0.085945
x0^14	-4.832276
x0^13	-7.941176
x0^10	-26.869921
x0^4	-31.828364
x0^5	-44.211552
x0^9	-70.499855

fig, axes = plt.subplots(2, 5, figsize=(20, 8))

degrees = np.arange(1, 20, 2)

for deg, ax in zip(degrees, axes.ravel()):
    pipe_poly = make_pipeline(PolynomialFeatures(degree=deg), LinearRegression())
    pipe_poly.fit(X_toy, y_toy)
    preds = pipe_poly.predict(X_toy)
    ax.scatter(X_toy[:, 0], y_toy, s=50, edgecolors=(0, 0, 0))
    ax.plot(X_toy, preds, color="green", linewidth=1.5)
    title = "degree={}".format(deg)
    ax.set_title(title)

../_images/ab244856d7c23541bb5d86ed290891ffb8491a7d8300a360ed2f340bddc20216.png

Classification setting: Non-linearly separable data#

Let’s consider this non-linearly separable 1-D data.

# Consider this one-dimensional classification dataset
n = 20
d = 1
np.random.seed(10)
X = np.random.randn(n, d)
y = np.sum(X**2, axis=1) < 0.4
figure(figsize=(6, 4), dpi=80)
# plt.scatter(X[:, 0], np.zeros_like(X), c=y, s=50, edgecolors=(0, 0, 0));
mglearn.discrete_scatter(X[:, 0], np.zeros_like(X), y)
plt.xlabel("Feature0")
# plt.legend();

Text(0.5, 0, 'Feature0')

../_images/12b1c665d7e25b5e950ee9c4a3833cee70dbabec513ed0f5da49217cf073c274.png

Can we use a linear classifier on this dataset?

linear_svm = SVC(kernel="linear", C=100)
linear_svm.fit(X, y)
print("Training accuracy", linear_svm.score(X, y))

Training accuracy 0.75

What if we augmented this data with polynomial with degree=2 feature?

X[:5]

array([[ 1.3315865 ],
       [ 0.71527897],
       [-1.54540029],
       [-0.00838385],
       [ 0.62133597]])

poly = PolynomialFeatures(
    2, include_bias=False
)  # Excluding the bias term for simplicity
X_transformed = poly.fit_transform(X)
X_transformed[0:5]

array([[ 1.33158650e+00,  1.77312262e+00],
       [ 7.15278974e-01,  5.11624011e-01],
       [-1.54540029e+00,  2.38826206e+00],
       [-8.38384993e-03,  7.02889396e-05],
       [ 6.21335974e-01,  3.86058392e-01]])

linear_svm = SVC(kernel="linear", C=100)
linear_svm.fit(X_transformed, y)
print("Training accuracy", linear_svm.score(X_transformed, y))
plot_orig_transformed_svc(linear_svm, X, X_transformed, y)

Training accuracy 1.0

../_images/cb5a0bcaf1871f4f22e2f5b6c34e6e1a577a805aa0da41dddbcb9804cc027390.png

The data is linearly separable in this new feature space!!

(Optional) Another example with two features#

import mglearn
from sklearn.datasets import make_blobs

X, y = make_blobs(centers=4, random_state=8)
y = y % 2

figure(figsize=(6, 4), dpi=80)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1");

../_images/261561385f0d8ded0d1903053dd539011ef44a2b1adc8e4a6d82b6eb5be87bb2.png

from sklearn.svm import LinearSVC

linear_svm = LinearSVC().fit(X, y)
figure(figsize=(6, 4), dpi=80)
mglearn.plots.plot_2d_separator(linear_svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1");

/Users/kvarada/opt/miniconda3/envs/573/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

../_images/bd92d05bdfe6dc534157c11ff8a9a4fad953c551cd19113ff86e44fa46c6f32d.png

# add square of the second feature
X_new = np.hstack([X, X[:, 1:] ** 2])
plot_mglearn_3d(X_new, y);

../_images/928d57e27bcd7a45c1629075d8eb56f9451b611a53936ccba72532a0312dc0d6.png

linear_svm_3d = LinearSVC().fit(X_new, y)
XX, YY = plot_svc_3d_decision_boundary(X_new, y, linear_svm_3d)

/Users/kvarada/opt/miniconda3/envs/573/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

../_images/5214681e99145076c4762171e8b40f2fcb5ff07abacd80bc76d9042e6a88f3da.png

What does this linear boundary in $Z$-space correspond to in the original ($X$) space?

figure(figsize=(6, 4), dpi=80)
plot_Z_space_boundary_in_X_space(linear_svm_3d, X, y, XX, YY)

../_images/2031c31e5c34d68c5b61389154f1347def09a20842a9cc0df9248e09fa4b01d3.png

It’s a parabola!

Another example with non-linearly separable data

from sklearn import datasets

figure(figsize=(6, 4), dpi=80)
X, y = datasets.make_circles(n_samples=200, noise=0.06, factor=0.4)
plt.scatter(X[:, 0], X[:, 1], s=50, c=y, cmap=plt.cm.Paired, edgecolors=(0, 0, 0));

../_images/79c8a043cc6ea40b8ecbfc2faa8fb13d3f15ab3b0dc8275ca9336fc2cac1e8ea.png

lr_circ = LinearRegression()
lr_circ.fit(X, y).score(X, y)

0.00015933776258525434

lr_circ_pipe = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
lr_circ_pipe.fit(X, y).score(X, y)

0.9669328201796121

Feature interactions and feature crosses#

A feature cross is a synthetic feature formed by multiplying or crossing two or more features.
Example: Is the following dataset (XOR function) linearly separable?

$$x_1$$	$$x_2$$	target
1	1	0
-1	1	1
1	-1	1
-1	-1	0

For XOR like problems, if we create a feature cross $x1x2$, the data becomes linearly separable.

$$x_1$$	$$x_2$$	$$x_1x_2$$	target
1	1	1	0
-1	1	-1	1
1	-1	-1	1
-1	-1	1	0

Let’s look at an example with more data points.

xx, yy = np.meshgrid(np.linspace(-3, 3, 50), np.linspace(-3, 3, 50))
rng = np.random.RandomState(0)
X_xor = rng.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0)

figure(figsize=(6, 4), dpi=80)
plt.scatter(
    X_xor[:, 0], X_xor[:, 1], s=30, c=y_xor, cmap=plt.cm.Paired, edgecolors=(0, 0, 0)
);

../_images/a413a4acec275e5548794f6fe8bc3dd3b07b963d4bb24e16a1f82dac625f88ba.png

LogisticRegression().fit(X_xor, y_xor).score(X_xor, y_xor)

0.535

pipe_xor = make_pipeline(
    PolynomialFeatures(interaction_only=True), LogisticRegression()
)
pipe_xor.fit(X_xor, y_xor)
pipe_xor.score(X_xor, y_xor)

0.995

feature_names = (
    pipe_xor.named_steps["polynomialfeatures"].get_feature_names_out().tolist()
)

pd.DataFrame(
    pipe_xor.named_steps["logisticregression"].coef_.transpose(),
    index=feature_names,
    columns=["Feature coefficient"],
)

	Feature coefficient
1	-0.001828
x0	-0.028418
x1	0.130472
x0 x1	-5.085936

The interaction feature has the biggest coefficient!

Feature crosses for one-hot encoded features#

You can think of feature crosses of one-hot-features as logical conjunctions
Suppose you want to predict whether you will find parking or not based on two features:
- area (possible categories: UBC campus and Rogers Arena)
- time of the day (possible categories: 9am and 7pm)
A feature cross in this case would create four new features:
- UBC campus and 9am
- UBC campus and 7pm
- Rogers Arena and 9am
- Rogers Arena and 7pm.
The features UBC campus and 9am on their own are not that informative but the newly created feature UBC campus and 9am or Rogers Arena and 7pm would be quite informative.

Coming up with the right combination of features requires some domain knowledge or careful examination of the data.
There is no easy way to support feature crosses in sklearn.

Questions to consider#

How do we know what degree of polynomial to use?

Can we plot data and see how does it look like so that we can pick polynomial with the appropriate degree?
- Plotting cannot take us much further
- Not possible to visualize high dimensional data
Can we consider one feature at a time?
- Hopeless when features interact with each other
- There is a possibility to draw misleading conclusions when you are only looking at one feature at a time.
Hyperparameter optimization
- Can be potentially very slow.

Problems with polynomial basis#

Let $d$ be the original number of features and $p$ be the degree of polynomial.
In general, we have roughly $\mathcal{O}(d^p)$ feature combinations.
- For example, for $d = 1000$, and $p = 3$, we would have around a billion new feature combinations!
This is problematic!

How can we do this efficiently?#

Kernel trick
- Computationally efficient approach to map features
- Calculate these relationships in higher dimensional space without actually carrying out the transformation.
- Overall, saying something is a “kernel method” correspond to this idea of implicitly calculating relationships in data in higher dimensional space.
- Then the different transformations have different names like “polynomial kernel” or “RBF kernel”
- For details see this video from CPSC 340.
More on RBF kernel later.

(Optional) Recall RBF Kernel#

Hard to visualize but you can think of this as a weighted nearest-neighbour.
During prediction, the closest examples have a lot of influence on how we classify the new example compared to the ones further away.
In general, for both regression/classification, you can think of RBF kernel as “smooth KNN”.
During test time, each training example gets to “vote” on the label of the test point and the amount of vote the $n^{th}$ training example gets is proportional to the the distance between the test point and itself.

RBFs#

What is a radial basis function (RBF)?
- A set of non-parametric bases that depend on distances to training points.
- Non-parametric because size of basis (number of features) grows with $n$.
Model gets more complicated as you get more data.

Example: RBFs#

Similar to polynomial basis, we transform $X$ to $Z$.
Consider $X_{train}$ with three examples: $x_1$, $x_2$, and $x_3$ and 2 features sand $X_{test}$ with two examples: $\tilde{x_1}$ and $\tilde{x_2}$

\[\begin{split}\text{Transform } X_{train} = \begin{bmatrix} 1 & 0\\ 2 & 1\\ 1 & 2\end{bmatrix} \text{ to } Z_{train} = \begin{bmatrix} g\lVert x_1 - x_1\rVert & g\lVert x_1 - x_2\rVert & g\lVert x_1 - x_3\rVert\\g\lVert x_2 - x_1\rVert & g\lVert x_2 - x_2\rVert & g\lVert x_2 - x_3\rVert\\g\lVert x_3 - x_1\rVert & g\lVert x_3 - x_2\rVert & g\lVert x_3 - x_3\rVert\end{bmatrix}\end{split}\]

\[\begin{split}\text{Transform } X_{test} = \begin{bmatrix} 2 & 1\\ 1 & 1 \end{bmatrix} \text{ to } Z_{test} = \begin{bmatrix} g\lVert \tilde{x_1} - x_1\rVert & g\lVert \tilde{x_1} - x_2\rVert & g\lVert \tilde{x_1} - x_3\rVert\\g\lVert \tilde{x_2} - x_1\rVert & g\lVert \tilde{x_2} - x_2\rVert & g\lVert \tilde{x_2} - x_3\rVert\\\end{bmatrix}\end{split}\]

We create $n$ features.

Gaussian Radial Basis Functions (Gaussian RBFs)#

Most common $g$ is Gaussian RBF: $$g(\varepsilon)=\exp\left(-\frac{\varepsilon^2}{2\sigma^2}\right)$$

Source

So in our case: $$g(x_i - x_j)=\exp\left(-\frac{\lVert x_i - x_j\rVert^2}{2\sigma^2}\right)$$
$\sigma$ is a hyperparameter that controls the width of the bumps.
We can fit least squares with different $\sigma$ values

Gaussian RBFs (non-parametric basis)#

source

How many bumps should we use?
- We use $n$ bumps (non-parametric basis)
Where should the bumps be centered?
- Each bump is centered on one training example $x_i$.
How high should the bumps go?
- Fitting regression weights $w$ gives us the heights (and signs).
How wide should the bumps be?
- The width is a hyper-parameter (narrow bumps = complicated model)

source

Enough bumps can approximate any continuous function to arbitrary precision.
But with $n$ data points RBFs have $n$ features
- How do we avoid overfitting with this huge number of features?
- We regularize $w$ (coming up in two weeks) and use validation error to choose $\sigma$ and $\lambda$.

Interpretation of `gamma` in SVM RBF#

gamma controls the complexity (fundamental trade-off).
- larger gamma $\rightarrow$ more complex
- smaller gamma $\rightarrow$ less complex
The gamma hyperparameter in SVC is related to $\sigma$ in Gaussian RBF.
You can think of gamma as the inverse width of the “bumps”
- Larger gamma means narrower peaks.
- Narrower peaks means more complex model.

\[g(x_i - x_j)=\exp\left(-\frac{\lVert x_i - x_j\rVert^2}{2\sigma^2}\right)\]

Constructing Gaussian RBF with $X$ and $\sigma$#

Z = zeros(n,n)
for i1 in 1:n
    for i2 in 1:n
        Z(i1,i2) = exp(-(norm(X[i1:] - X[i2:])**2)/(2 * sigma**2)

Gaussian RBFs: Prediction#

Given a test example $\tilde{x_i}$:

\[\hat{y_i} = w_1 \exp\left(\frac{-\lVert \tilde{x_i} - x_1\rVert^2}{2\sigma^2}\right) + w_2 \exp\left(\frac{-\lVert \tilde{x_i} - x_2\rVert^2}{2\sigma^2}\right) + \dots + w_n \exp\left(\frac{-\lVert \tilde{x_i} - x_n\rVert^2}{2\sigma^2}\right) = \sum_{j = 1}^n w_j \exp\left(\frac{-\lVert \tilde{x_i} - x_j\rVert^2}{2\sigma^2}\right) \]
Expensive at test time: needs distance to all training examples.

RBF with regulariation and optimized $sigma$ and $\lambda$#

A model that is hard to beat:
- RBF basis with L2-regularization and cross-validation to choose $\sigma$ and $\lambda$.
- Flexible non-parametric basis, magic of regularization, and tuning for test error
For each value of $\lambda$ and $sigma$
- Compute $Z$ on training data
- Compute best weights $V$ using least squares
- Compute $\tilde{Z}$ on validation set (using train set distances)
- Make predictions $\hat{y} = \tilde{Z}v$
- Compute validation error

Using RBF with least squares: KernelRidge (optional)#

sklearn.kernel_ridge.KernelRidge(alpha=1, kernel=’linear’, gamma=None, degree=3, coef0=1, kernel_params=None)

Kernel ridge regression. Kernel ridge regression (KRR) combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.

The form of the model learned by KRR is identical to support vector regression (SVR). However, different loss functions are used: KRR uses squared error loss while support vector regression uses epsilon-insensitive loss, both combined with l2 regularization. In contrast to SVR, fitting a KRR model can be done in closed-form and is typically faster for medium-sized datasets. On the other hand, the learned model is non-sparse and thus slower than SVR, which learns a sparse model for epsilon > 0, at prediction-time.

❓❓ Questions for you#

iClicker Exercise 3.2#

iClicker cloud join link: https://join.iclicker.com/C0P55

Select all of the following statements which are TRUE.

(A) Suppose we add quadratic features to dataset $X$ and the augmented dataset is $Z$. Fitting linear regression on $Z$ would learn a linear function of $Z$.
(B) The least squares loss function shown below is independent of the bias term (the $y$ intercept).

\[ J(w) = \sum_{i=1}^{n}{(w^Tx_i - y_i)^2}\]

(C) If you get the same validation error with polynomials of degrees $d$ and $d+4$, it is a better to pick the polynomial of degree $d$.
(D) If you are given a large dataset with 1000 features, it’s a good idea to start simple and work with one or two features in order to verify your intuitions.
(E) Suppose you apply polynomial transformations with degree 3 polynomial during training. During prediction time on the test set, you must calculate degree three polynomial features of the given feature vector in order to get predictions.

V’s answers: (A), (C), (D), (E)

Break (5 min)#

Demo of feature engineering with numeric features#

Remember the California housing dataset we used in DSCI 571?
The prediction task is predicting median_house_value for a given property.

housing_df = pd.read_csv("data/california_housing.csv")
housing_df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Suppose we decide to train ridge model on this dataset.

What would happen if you train a model without applying any transformation on the categorical features ocean_proximity?
- Error!! A linear model requires all features in a numeric form.
What would happen if we apply OHE on ocean_proximity but we do not scale the features?
- No syntax error. But the model results are likely to be poor.
Do we need to apply any other transformations on this data?

In this section, we will look into some common ways to do feature engineering for numeric or categorical features.

train_df, test_df = train_test_split(housing_df, test_size=0.2, random_state=123)

We have total rooms and the number of households in the neighbourhood. How about creating rooms_per_household feature using this information?

train_df = train_df.assign(
    rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
    rooms_per_household=test_df["total_rooms"] / test_df["households"]
)

train_df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	rooms_per_household
9950	-122.33	38.38	28.0	1020.0	169.0	504.0	164.0	4.5694	287500.0	INLAND	6.219512
3547	-118.60	34.26	18.0	6154.0	1070.0	3010.0	1034.0	5.6392	271500.0	<1H OCEAN	5.951644
4448	-118.21	34.07	47.0	1346.0	383.0	1452.0	371.0	1.7292	191700.0	<1H OCEAN	3.628032
6984	-118.02	33.96	36.0	2071.0	398.0	988.0	404.0	4.6226	219700.0	<1H OCEAN	5.126238
4432	-118.20	34.08	49.0	1320.0	309.0	1405.0	328.0	2.4375	114000.0	<1H OCEAN	4.024390
...	...	...	...	...	...	...	...	...	...	...	...
7763	-118.10	33.91	36.0	726.0	NaN	490.0	130.0	3.6389	167600.0	<1H OCEAN	5.584615
15377	-117.24	33.37	14.0	4687.0	793.0	2436.0	779.0	4.5391	180900.0	<1H OCEAN	6.016688
17730	-121.76	37.33	5.0	4153.0	719.0	2435.0	697.0	5.6306	286200.0	<1H OCEAN	5.958393
15725	-122.44	37.78	44.0	1545.0	334.0	561.0	326.0	3.8750	412500.0	NEAR BAY	4.739264
19966	-119.08	36.21	20.0	1911.0	389.0	1241.0	348.0	2.5156	59300.0	INLAND	5.491379

16512 rows × 11 columns

Let’s start simple. Imagine that we only three features: longitude, latitude, and our newly created rooms_per_household feature.

X_train_housing = train_df[["latitude", "longitude", "rooms_per_household"]]
y_train_housing = train_df["median_house_value"]

from sklearn.compose import make_column_transformer

numeric_feats = ["latitude", "longitude", "rooms_per_household"]

preprocessor1 = make_column_transformer(
    (make_pipeline(SimpleImputer(), StandardScaler()), numeric_feats)
)

lr_1 = make_pipeline(preprocessor1, Ridge())
pd.DataFrame(
    cross_validate(lr_1, X_train_housing, y_train_housing, return_train_score=True)
)

	fit_time	score_time	test_score	train_score
0	0.005102	0.001189	0.280028	0.311769
1	0.002480	0.000848	0.325319	0.300464
2	0.002208	0.000880	0.317277	0.301952
3	0.002187	0.000890	0.316798	0.303004
4	0.002228	0.000933	0.260258	0.314840

The scores are not great.
Let’s look at the distribution of the longitude and latitude features.

figure(figsize=(6, 4), dpi=80)
plt.hist(train_df["longitude"], bins=50)
plt.title("Distribution of latitude feature");

../_images/b4abc1ca508ec63e5d22e07976da474831add072dd931148aac061fba10b16e7.png

figure(figsize=(6, 4), dpi=80)
plt.hist(train_df["latitude"], bins=50)
plt.title("Distribution of latitude feature");

../_images/2246440d70486a1cdfd22b1a547626f112737bb396a4336c021dd683a41ee431.png

Suppose you are planning to build a linear model for housing price prediction.
If we think longitude is a good feature for prediction, does it makes sense to use the floating point representation of this feature that’s given to us?
Remember that linear models can capture only linear relationships.

How about discretizing latitude and longitude features and putting them into buckets?
This process of transforming numeric features into categorical features is called bucketing or binning.
In sklearn you can do this using KBinsDiscretizer transformer.
Let’s examine whether we get better results with binning.

from sklearn.preprocessing import KBinsDiscretizer

discretization_feats = ["latitude", "longitude"]
numeric_feats = ["rooms_per_household"]

preprocessor2 = make_column_transformer(
    (KBinsDiscretizer(n_bins=20, encode="onehot"), discretization_feats),
    (make_pipeline(SimpleImputer(), StandardScaler()), numeric_feats),
)

lr_2 = make_pipeline(preprocessor2, Ridge())
pd.DataFrame(
    cross_validate(lr_2, X_train_housing, y_train_housing, return_train_score=True)
)

	fit_time	score_time	test_score	train_score
0	0.015637	0.002666	0.441442	0.456418
1	0.010789	0.002345	0.469554	0.446215
2	0.010230	0.002478	0.479166	0.446868
3	0.009855	0.002336	0.450818	0.453366
4	0.009889	0.002189	0.388175	0.467627

The results are better with binned features. Let’s examine how do these binned features look like.

lr_2.fit(X_train_housing, y_train_housing)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('kbinsdiscretizer',
                                                  KBinsDiscretizer(n_bins=20),
                                                  ['latitude', 'longitude']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['rooms_per_household'])])),
                ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pd.DataFrame(
    preprocessor2.fit_transform(X_train_housing).todense(),
    columns=preprocessor2.get_feature_names_out(),
)

	kbinsdiscretizer__latitude_0.0	kbinsdiscretizer__latitude_1.0	kbinsdiscretizer__latitude_2.0	kbinsdiscretizer__latitude_3.0	kbinsdiscretizer__latitude_4.0	kbinsdiscretizer__latitude_5.0	kbinsdiscretizer__latitude_6.0	kbinsdiscretizer__latitude_7.0	kbinsdiscretizer__latitude_8.0	kbinsdiscretizer__latitude_9.0	...	kbinsdiscretizer__longitude_11.0	kbinsdiscretizer__longitude_12.0	kbinsdiscretizer__longitude_13.0	kbinsdiscretizer__longitude_14.0	kbinsdiscretizer__longitude_15.0	kbinsdiscretizer__longitude_16.0	kbinsdiscretizer__longitude_17.0	kbinsdiscretizer__longitude_18.0	kbinsdiscretizer__longitude_19.0	pipeline__rooms_per_household
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.316164
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.209903
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.711852
3	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	-0.117528
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.554621
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16507	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.064307
16508	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.235706
16509	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.212581
16510	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.271037
16511	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.027321

16512 rows × 41 columns

How about discretizing all three features?

from sklearn.preprocessing import KBinsDiscretizer

discretization_feats = ["latitude", "longitude", "rooms_per_household"]

preprocessor3 = make_column_transformer(
    (KBinsDiscretizer(n_bins=20, encode="onehot"), discretization_feats),
)

lr_3 = make_pipeline(preprocessor3, Ridge())
pd.DataFrame(
    cross_validate(lr_3, X_train_housing, y_train_housing, return_train_score=True)
)

	fit_time	score_time	test_score	train_score
0	0.010873	0.002158	0.590610	0.571969
1	0.010465	0.002116	0.575886	0.570473
2	0.011467	0.002253	0.579108	0.573541
3	0.010105	0.001971	0.571495	0.574259
4	0.010333	0.002268	0.541501	0.581687

The results have improved further!!
Let’s examine the coefficients

lr_3.fit(X_train_housing, y_train_housing)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('kbinsdiscretizer',
                                                  KBinsDiscretizer(n_bins=20),
                                                  ['latitude', 'longitude',
                                                   'rooms_per_household'])])),
                ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

feature_names = (
    lr_3.named_steps["columntransformer"]
    .named_transformers_["kbinsdiscretizer"]
    .get_feature_names_out()
)

lr_3.named_steps["ridge"].coef_.shape

(60,)

coefs_df = pd.DataFrame(
    lr_3.named_steps["ridge"].coef_.transpose(),
    index=feature_names,
    columns=["coefficient"],
).sort_values("coefficient", ascending=False)
coefs_df.head

<bound method NDFrame.head of                             coefficient
longitude_1.0             211264.741526
latitude_1.0              205072.700578
latitude_0.0              201872.963204
longitude_0.0             190382.307281
longitude_2.0             160226.705422
longitude_3.0             157170.570627
latitude_2.0              154016.104655
rooms_per_household_19.0  138446.914059
latitude_8.0              135312.891136
longitude_4.0             132299.492558
latitude_7.0              124997.893455
latitude_3.0              118555.428435
longitude_5.0             116178.110119
rooms_per_household_18.0  102055.752443
longitude_6.0              96570.932606
latitude_4.0               92839.285885
latitude_6.0               90957.852350
latitude_9.0               71151.812511
rooms_per_household_17.0   70487.600832
latitude_5.0               69460.010604
longitude_10.0             52387.931534
rooms_per_household_16.0   44318.361075
rooms_per_household_15.0   31461.590554
longitude_7.0              25695.177065
latitude_10.0              20272.013164
rooms_per_household_14.0   16466.784005
rooms_per_household_13.0    9355.137662
longitude_8.0               6379.905443
rooms_per_household_12.0    1851.911387
rooms_per_household_11.0  -12173.744559
longitude_9.0             -14485.569105
rooms_per_household_10.0  -16630.179605
rooms_per_household_9.0   -19582.738886
longitude_11.0            -22826.368243
rooms_per_household_8.0   -26908.900187
rooms_per_household_7.0   -30572.917476
rooms_per_household_6.0   -32727.412466
rooms_per_household_4.0   -40703.331515
rooms_per_household_3.0   -42079.021099
rooms_per_household_5.0   -43463.598682
rooms_per_household_2.0   -47601.049038
rooms_per_household_0.0   -50857.281189
rooms_per_household_1.0   -51143.877316
latitude_13.0             -57455.320993
longitude_14.0            -70996.211314
longitude_13.0            -89279.126781
longitude_12.0            -90638.927965
latitude_11.0            -100372.149421
longitude_15.0           -105073.921333
latitude_12.0            -111471.133870
latitude_14.0            -114773.573743
latitude_15.0            -116412.142564
longitude_16.0           -119542.396969
latitude_16.0            -140126.669111
longitude_17.0           -174828.915100
latitude_18.0            -185916.174811
latitude_17.0            -195559.834276
longitude_18.0           -205153.346778
longitude_19.0           -255731.090593
latitude_19.0            -262421.957186>

Does it make sense to take feature crosses in this context?
What information would they encode?

Feature engineering for text data#

Feature engineering is very relevant in the context of “structured” data such as text data or image data.
We can extract important information using human knowledge and incorporate it into our models.
So it’s hard to talk about general methods for feature engineering.
In the lab you’ll be carrying out feature engineering on text data.
Let’s look at an example of feature engineering for text data.

We will be using Covid tweets dataset for this.

df = pd.read_csv("data/Corona_NLP_test.csv")
df["Sentiment"].value_counts()

Negative              1041
Positive               947
Neutral                619
Extremely Positive     599
Extremely Negative     592
Name: Sentiment, dtype: int64

train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)

train_df

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
1927	1928	46880	Seattle, WA	13-03-2020	While I don't like all of Amazon's choices, to...	Positive
1068	1069	46021	NaN	13-03-2020	Me: shit buckets, its time to do the weekly s...	Negative
803	804	45756	The Outer Limits	12-03-2020	@SecPompeo @realDonaldTrump You mean the plan ...	Neutral
2846	2847	47799	Flagstaff, AZ	15-03-2020	@lauvagrande People who are sick arent panic ...	Extremely Negative
3768	3769	48721	Montreal, Canada	16-03-2020	Coronavirus Panic: Toilet Paper Is the People...	Negative
...	...	...	...	...	...	...
1122	1123	46075	NaN	13-03-2020	Photos of our local grocery store shelveswher...	Extremely Positive
1346	1347	46299	Toronto	13-03-2020	Just went to the the grocery store (Highland F...	Positive
3454	3455	48407	Houston, TX	16-03-2020	Real talk though. Am I the only one spending h...	Neutral
3437	3438	48390	Washington, DC	16-03-2020	The supermarket business is booming! #COVID2019	Neutral
3582	3583	48535	St James' Park, Newcastle	16-03-2020	Evening All Here s the story on the and the im...	Positive

3038 rows × 6 columns

train_df.columns

Index(['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet',
       'Sentiment'],
      dtype='object')

train_df["Location"].value_counts()

United States                     63
London, England                   37
Los Angeles, CA                   30
New York, NY                      29
Washington, DC                    29
                                  ..
Suburb of Chicago                  1
philippines                        1
Dont ask for freedom, take it.     1
Windsor Heights, IA                1
St James' Park, Newcastle          1
Name: Location, Length: 1441, dtype: int64

X_train, y_train = train_df[["OriginalTweet"]], train_df["Sentiment"]
X_test, y_test = test_df[["OriginalTweet"]], test_df["Sentiment"]

y_train.value_counts()

Negative              852
Positive              743
Neutral               501
Extremely Negative    472
Extremely Positive    470
Name: Sentiment, dtype: int64

scoring_metrics = "accuracy"

results = {}

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

Dummy classifier#

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
results["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T

	fit_time	score_time	test_score	train_score
dummy	0.001 (+/- 0.000)	0.000 (+/- 0.000)	0.280 (+/- 0.001)	0.280 (+/- 0.000)

Bag-of-words model#

from sklearn.feature_extraction.text import CountVectorizer

pipe = make_pipeline(
    CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)
)
results["logistic regression"] = mean_std_cross_val_scores(
    pipe,
    X_train["OriginalTweet"],
    y_train,
    return_train_score=True,
    scoring=scoring_metrics,
)
pd.DataFrame(results).T

	fit_time	score_time	test_score	train_score
dummy	0.001 (+/- 0.000)	0.000 (+/- 0.000)	0.280 (+/- 0.001)	0.280 (+/- 0.000)
logistic regression	0.477 (+/- 0.024)	0.007 (+/- 0.000)	0.413 (+/- 0.011)	0.999 (+/- 0.000)

Is it possible to further improve the scores?#

How about adding new features based on our intuitions? Let’s extract our own features that might be useful for this prediction task. In other words, let’s carry out feature engineering.
The code below adds some very basic length-related and sentiment features. We will be using a popular library called nltk for this exercise. If you have successfully created the course conda environment on your machine, you should already have this package in the environment.

How do we extract interesting information from text?
We use pre-trained models!

A couple of popular libraries which include such pre-trained models.
nltk

conda install -c anaconda nltk 

spaCy

conda install -c conda-forge spacy

For emoji support:

pip install spacymoji

You also need to download the language model which contains all the pre-trained models. For that run the following in your course conda environment or here.

import spacy

# !python -m spacy download en_core_web_md

import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/kvarada/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

True

nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/kvarada/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kvarada/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

s = "MDS students are smart, sweet, and funny."
print(sid.polarity_scores(s))

{'neg': 0.0, 'neu': 0.317, 'pos': 0.683, 'compound': 0.8225}

s = "MDS students are tired because of all the hard work they have been doing."
print(sid.polarity_scores(s))

{'neg': 0.264, 'neu': 0.736, 'pos': 0.0, 'compound': -0.5106}

spaCy #

A useful package for text processing and feature extraction

Active development: https://github.com/explosion/spaCy
Interactive lessons by Ines Montani: https://course.spacy.io/en/
Good documentation, easy to use, and customizable.

import en_core_web_md  # pre-trained model
import spacy

nlp = en_core_web_md.load()

sample_text = """Dolly Parton is a gift to us all. 
From writing all-time great songs like “Jolene” and “I Will Always Love You”, 
to great performances in films like 9 to 5, to helping fund a COVID-19 vaccine, 
she’s given us so much. Now, Netflix bring us Dolly Parton’s Christmas on the Square, 
an original musical that stars Christine Baranski as a Scrooge-like landowner 
who threatens to evict an entire town on Christmas Eve to make room for a new mall. 
Directed and choreographed by the legendary Debbie Allen and counting Jennifer Lewis 
and Parton herself amongst its cast, Christmas on the Square seems like the perfect movie
to save Christmas 2020. 😻 👍🏿"""

# [Adapted from here.](https://thepopbreak.com/2020/11/22/dolly-partons-christmas-on-the-square-review-not-quite-a-christmas-miracle/)

Spacy extracts all interesting information from text with this call.

doc = nlp(sample_text)

Let’s look at part-of-speech tags.

print([(token, token.pos_) for token in doc][:20])

[(Dolly, 'PROPN'), (Parton, 'PROPN'), (is, 'AUX'), (a, 'DET'), (gift, 'NOUN'), (to, 'ADP'), (us, 'PRON'), (all, 'PRON'), (., 'PUNCT'), (
, 'SPACE'), (From, 'ADP'), (writing, 'VERB'), (all, 'DET'), (-, 'PUNCT'), (time, 'NOUN'), (great, 'ADJ'), (songs, 'NOUN'), (like, 'ADP'), (“, 'PUNCT'), (Jolene, 'PROPN')]

Often we want to know who did what to whom.
Named entities give you this information.
What are named entities in the text?

print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("\nPERSON means: ", spacy.explain("PERSON"))
print("\nDATE means: ", spacy.explain("DATE"))

Named entities:
 [('Dolly Parton', 'PERSON'), ('Jolene', 'WORK_OF_ART'), ('I Will Always Love You', 'WORK_OF_ART'), ('9', 'CARDINAL'), ('Netflix', 'ORG'), ('Dolly Parton’s', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christine Baranski', 'PERSON'), ('Christmas Eve', 'DATE'), ('Debbie Allen', 'PERSON'), ('Jennifer Lewis', 'PERSON'), ('Parton', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christmas 2020', 'DATE'), ('😻', 'GPE')]

ORG means:  Companies, agencies, institutions, etc.

PERSON means:  People, including fictional

DATE means:  Absolute or relative dates or periods

from spacy import displacy

displacy.render(doc, style="ent")

Dolly Parton PERSON is a gift to us all.
From writing all-time great songs like “ Jolene WORK_OF_ART ” and “ I Will Always Love You WORK_OF_ART ”,
to great performances in films like 9 CARDINAL to 5, to helping fund a COVID-19 vaccine,
she’s given us so much. Now, Netflix ORG bring us Dolly Parton’s PERSON Christmas DATE on the Square FAC ,
an original musical that stars Christine Baranski PERSON as a Scrooge-like landowner
who threatens to evict an entire town on Christmas Eve DATE to make room for a new mall.
Directed and choreographed by the legendary Debbie Allen PERSON and counting Jennifer Lewis PERSON
and Parton PERSON herself amongst its cast, Christmas DATE on the Square FAC seems like the perfect movie
to save Christmas 2020 DATE . 😻 GPE 👍🏿

An example from a project#

Goal: Extract and visualize inter-corporate relationships from disclosed annual 10-K reports of public companies.

Source for the text below.

text = (
    "Heavy hitters, including Microsoft and Google, "
    "are competing for customers in cloud services with the likes of IBM and Salesforce."
)

doc = nlp(text)
displacy.render(doc, style="ent")
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])

Heavy hitters, including Microsoft ORG and Google ORG , are competing for customers in cloud services with the likes of IBM ORG and Salesforce PRODUCT .

Named entities:
 [('Microsoft', 'ORG'), ('Google', 'ORG'), ('IBM', 'ORG'), ('Salesforce', 'PRODUCT')]

If you want emoji identification support install spacymoji in the course environment.

pip install spacymoji

After installing spacymoji, if it’s still complaining about module not found, my guess is that you do not have pip installed in your conda environment. Go to your course conda environment install pip and install the spacymoji package in the environment using the pip you just installed in the current environment.

conda install pip
YOUR_MINICONDA_PATH/miniconda3/envs/cpsc330/bin/pip install spacymoji

from spacymoji import Emoji

nlp.add_pipe("emoji", first=True);

Does the text have any emojis? If yes, extract the description.

doc = nlp(sample_text)
doc._.emoji

[('😻', 138, 'smiling cat face with heart-eyes'),
 ('👍🏿', 139, 'thumbs up dark skin tone')]

Simple feature engineering for our problem.#

def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
    """
    Returns the relative length of text.

    Parameters:
    ------
    text: (str)
    the input text

    Keyword arguments:
    ------
    TWITTER_ALLOWED_CHARS: (float)
    the denominator for finding relative length

    Returns:
    -------
    relative length of text: (float)

    """
    return len(text) / TWITTER_ALLOWED_CHARS


def get_length_in_words(text):
    """
    Returns the length of the text in words.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    length of tokenized text: (int)

    """
    return len(nltk.word_tokenize(text))


def get_sentiment(text):
    """
    Returns the compound score representing the sentiment: -1 (most extreme negative) and +1 (most extreme positive)
    The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    sentiment of the text: (str)
    """
    scores = sid.polarity_scores(text)
    return scores["compound"]

train_df = train_df.assign(n_words=train_df["OriginalTweet"].apply(get_length_in_words))
train_df = train_df.assign(
    vader_sentiment=train_df["OriginalTweet"].apply(get_sentiment)
)
train_df = train_df.assign(
    rel_char_len=train_df["OriginalTweet"].apply(get_relative_length)
)

test_df = test_df.assign(n_words=test_df["OriginalTweet"].apply(get_length_in_words))
test_df = test_df.assign(vader_sentiment=test_df["OriginalTweet"].apply(get_sentiment))
test_df = test_df.assign(
    rel_char_len=test_df["OriginalTweet"].apply(get_relative_length)
)

train_df.shape

(3038, 9)

X_train

	OriginalTweet
1927	While I don't like all of Amazon's choices, to...
1068	Me: shit buckets, its time to do the weekly s...
803	@SecPompeo @realDonaldTrump You mean the plan ...
2846	@lauvagrande People who are sick arent panic ...
3768	Coronavirus Panic: Toilet Paper Is the People...
...	...
1122	Photos of our local grocery store shelveswher...
1346	Just went to the the grocery store (Highland F...
3454	Real talk though. Am I the only one spending h...
3437	The supermarket business is booming! #COVID2019
3582	Evening All Here s the story on the and the im...

3038 rows × 1 columns

X_train = train_df.drop(columns=["Sentiment"])

numeric_features = ["vader_sentiment", "rel_char_len", "n_words"]
text_feature = "OriginalTweet"
drop_features = ["UserName", "ScreenName", "Location", "TweetAt"]

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (CountVectorizer(stop_words="english"), text_feature),
    ("drop", drop_features),
)

pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
results["LR (more feats)"] = mean_std_cross_val_scores(
    pipe, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T

	fit_time	score_time	test_score	train_score
dummy	0.001 (+/- 0.000)	0.000 (+/- 0.000)	0.280 (+/- 0.001)	0.280 (+/- 0.000)
logistic regression	0.477 (+/- 0.024)	0.007 (+/- 0.000)	0.413 (+/- 0.011)	0.999 (+/- 0.000)
LR (more feats)	0.608 (+/- 0.069)	0.010 (+/- 0.001)	0.689 (+/- 0.006)	0.998 (+/- 0.001)

pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['vader_sentiment',
                                                   'rel_char_len', 'n_words']),
                                                 ('countvectorizer',
                                                  CountVectorizer(stop_words='english'),
                                                  'OriginalTweet'),
                                                 ('drop', 'drop',
                                                  ['UserName', 'ScreenName',
                                                   'Location', 'TweetAt'])])),
                ('logisticregression', LogisticRegression(max_iter=1000))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cv_feats = (
    pipe.named_steps["columntransformer"]
    .named_transformers_["countvectorizer"]
    .get_feature_names_out().tolist()
)

feat_names = numeric_features + cv_feats

coefs = pipe.named_steps["logisticregression"].coef_[0]

df = pd.DataFrame(
    data={
        "features": feat_names,
        "coefficients": coefs,
    }
)
df.sort_values("coefficients")

	features	coefficients
0	vader_sentiment	-6.156919
11329	won	-1.386911
2549	coronapocalypse	-0.818268
2212	closed	-0.751250
8659	retail	-0.730537
...	...	...
3297	don	1.146414
9860	stupid	1.159503
4877	hell	1.295756
3127	die	1.364247
7502	panic	1.507011

11662 rows × 2 columns

We get some improvements with our engineered features!

Common features used in text classification#

Bag of words#

So far for text data we have been using bag of word features.
They are good enough for many tasks. But …
This encoding throws out a lot of things we know about language
It assumes that word order is not that important.
So if you want to improve the scores further on text classification tasks you carry out feature engineering.

Let’s look at some examples from research papers.

Example: Label “Personalized” Important E-mails:#

The Learning Behind Gmail Priority Inbox
Features: bag of words, trigrams, regular expressions, and so on.
There might be some “globally” important messages:
- “This is your mother, something terrible happened, give me a call ASAP.”
But your “important” message may be unimportant to others.
- Similar for spam: “spam” for one user could be “not spam” for another.

Social features (e.g., percentage of sender emails that is read by the recipient)
Content features (e.g., recent terms the user has been using in emails)
Thread features (e.g., whether the user has started the thread)
…

The Learning Behind Gmail Priority Inbox #

Feature engineering examples: Automatically Identifying Good Conversations Online #

(Optional) Term weighing (TF-IDF)#

A measure of relatedness between words and documents
Intuition: Meaningful words may occur repeatedly in related documents, but functional words (e.g., make, the) may be distributed evenly over all documents

\[tf.idf(w_i,d_j) = (1+log(tf_{ij})) log\frac{D}{df_i}\]

where,

$tf_{ij}$ → number of occurrences of the term $w_i$ in document $d_j$
$D$ → number of documents
$df_i$ → number of documents in which $w_i$ occurs

Check TfidfVectorizer from sklearn.

N-grams#

Incorporating more context
A contiguous sequence of n items (characters, tokens) in text.

MDS students are hard-working .
2-grams (bigrams): a contiguous sequence of two words
- MDS students, students are, are hard-working, hard-working ._
3-grams (trigrams): a contiguous sequence of three words
- MDS students are, students are hard-working, are hard-working .

You can extract ngram features using CountVectorizer by passing ngram_range.

from sklearn.feature_extraction.text import CountVectorizer

X = [
    "URGENT!! As a valued network customer you have been selected to receive a $900 prize reward!",
    "Lol you are always so convincing.",
    "URGENT!! Call right away!!",
]
vec = CountVectorizer(ngram_range=(1, 3))
X_counts = vec.fit_transform(X)
bow_df = pd.DataFrame(X_counts.toarray(), columns=vec.get_feature_names(), index=X)

/Users/kvarada/opt/miniconda3/envs/573/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

bow_df

	900	900 prize	900 prize reward	always	always so	always so convincing	are	are always	are always so	as	...	urgent call	urgent call right	valued	valued network	valued network customer	you	you are	you are always	you have	you have been
URGENT!! As a valued network customer you have been selected to receive a $900 prize reward!	1	1	1	0	0	0	0	0	0	1	...	0	0	1	1	1	1	0	0	1	1
Lol you are always so convincing.	0	0	0	1	1	1	1	1	1	0	...	0	0	0	0	0	1	1	1	0	0
URGENT!! Call right away!!	0	0	0	0	0	0	0	0	0	0	...	1	1	0	0	0	0	0	0	0	0

3 rows × 61 columns

ASIDE: Google n-gram viewer #

All Our N-gram are Belong to You
- https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-toyou.html

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.”

from IPython.display import HTML

url = "https://books.google.com/ngrams/"
HTML("<iframe src=%s width=1000 height=800></iframe>" % url)

/Users/kvarada/opt/miniconda3/envs/573/lib/python3.10/site-packages/IPython/core/display.py:431: UserWarning: Consider using IPython.display.IFrame instead
  warnings.warn("Consider using IPython.display.IFrame instead")

Aside: Google n-gram viewer #

Count the occurrences of the bigram smart women in the corpus from 1800 to 2000

Aside: Google n-gram viewer #

Trends in the word challenge used as a noun vs. verb

Part-of-speech features#

Part-of-speech (POS) in English#

Part-of-speech: A kind of syntactic category that tells you some of the grammatical properties of a word.
- Noun → water, sun, cat
- Verb → run, eat, teach

The ____ was running.

Only a noun fits here.

Part-of-speech (POS) features#

POS features use POS information for the words in text.

CPSC330/PROPER_NOUN students/NOUN are/VERB hard-working/ADJECTIVE

An example from a project#

Data: a bunch of documents
Task: identify texts with permissions and identify who is giving permission to whom.

You may disclose Google confidential information when compelled to do so by law if you provide us reasonable prior notice, unless a court orders that we not receive notice.

A very simple solution
- Look for pronouns and verbs.
- Add POS tags as features in your model.
- Maybe look up words similar to disclose.

Penn Treebank part-of-speech tags (bonus)#

You also need to download the language model which contains all the pre-trained models. For that run the following in your course conda environment.

Interim summary#

In the context of text data, if we want to go beyond bag-of-words and incorporate human knowledge in models, we carry out feature engineering.
Some common features include:
- ngram features
- part-of-speech features
- named entity features
- emoticons in text
These are usually extracted from pre-trained models using libraries such as spaCy.
Now a lot of this has moved to deep learning.
But many industries still rely on manual feature engineering.

Summary#

What did we learn today?

Feature engineering is finding the useful representation of the data that can help us effectively solve our problem.
Non-linear regression (change of basis)
Radial basis functions
Importance of feature engineering in text data and audio data.

Feature engineering#

The best features are application-dependent.
It’s hard to give general advice. But here are some guidelines.
- Ask the domain experts.
- Go through academic papers in the discipline.
- Often have idea of right discretization/standardization/transformation.
- If no domain expert, cross-validation will help.
If you have lots of data, use deep learning methods.

The algorithms we used are very standard for Kagglers ... We spent most of our efforts in feature engineering...
- Xavier Conort, on winning the Flight Quest challenge on Kaggle

$\(x_1\)$	$\(x_2\)$	target
1	1	0
-1	1	1
1	-1	1
-1	-1	0

$\(x_1\)$	$\(x_2\)$	$\(x_1x_2\)$	target
1	1	1	0
-1	1	-1	1
1	-1	-1	1
-1	-1	1	0

Lecture 3: Feature engineering

Contents

Lecture 3: Feature engineering#

Imports and LO#

Imports#

Learning outcomes#

Feature engineering: Motivation#

❓❓ Questions for you#

iClicker Exercise 3.1#

Garbage in, garbage out.#

Activity: How can you measure quality of the data? (~3 mins)#

What is feature engineering?#

Some quotes on feature engineering#

Better features usually help more than a better model.#

The best features may be dependent on the model you use.#

Domain-specific transformations#

Polynomial feature transformations for non-linear regression#

Linear regression prediction (recap)#

Matrix vector notation#

An example of column of ones notation#

Limitations of linear regression#

Polynomial transformations#

Adding quadratic features#

Polynomial regression in sklearn#

Interim summary#

PolynomialFeatures with sklearn pipelines#

Classification setting: Non-linearly separable data#

(Optional) Another example with two features#

Feature interactions and feature crosses#

Feature crosses for one-hot encoded features#

Questions to consider#

Problems with polynomial basis#

How can we do this efficiently?#

(Optional) Recall RBF Kernel#

RBFs#

Example: RBFs#

Gaussian Radial Basis Functions (Gaussian RBFs)#

Gaussian RBFs (non-parametric basis)#

Interpretation of gamma in SVM RBF#

Constructing Gaussian RBF with \(X\) and \(\sigma\)#

Gaussian RBFs: Prediction#

RBF with regulariation and optimized \(sigma\) and \(\lambda\)#

Using RBF with least squares: KernelRidge (optional)#

❓❓ Questions for you#

iClicker Exercise 3.2#

Break (5 min)#

Demo of feature engineering with numeric features#

Feature engineering for text data#

Dummy classifier#

Bag-of-words model#

Is it possible to further improve the scores?#

spaCy#

An example from a project#

Simple feature engineering for our problem.#

Common features used in text classification#

Bag of words#

Example: Label “Personalized” Important E-mails:#

The Learning Behind Gmail Priority Inbox#

Feature engineering examples: Automatically Identifying Good Conversations Online#

(Optional) Term weighing (TF-IDF)#

N-grams#

ASIDE: Google n-gram viewer#

Aside: Google n-gram viewer#

Aside: Google n-gram viewer#

Part-of-speech features#

Part-of-speech (POS) in English#

Part-of-speech (POS) features#

An example from a project#

Penn Treebank part-of-speech tags (bonus)#

Interim summary#

Summary#

Feature engineering#

Polynomial regression in `sklearn`#

`PolynomialFeatures` with sklearn pipelines#

Interpretation of `gamma` in SVM RBF#

spaCy #

The Learning Behind Gmail Priority Inbox #

Feature engineering examples: Automatically Identifying Good Conversations Online #

ASIDE: Google n-gram viewer #

Aside: Google n-gram viewer #

Aside: Google n-gram viewer #