Lecture 2: Regression metrics#

UBC Master of Data Science program, 2021-22

Instructor: Varada Kolhatkar

Imports#

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor

%matplotlib inline
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

Learning outcomes#

From this lecture, students are expected to be able to:

  • Carry out feature transformations on somewhat complicated dataset.

  • Visualize transformed features as a dataframe.

  • Use Ridge and RidgeCV.

  • Explain how alpha hyperparameter of Ridge relates to the fundamental tradeoff.

  • Explain the effect of alpha on the magnitude of the learned coefficients.

  • Examine coefficients of transformed features.

  • Appropriately select a scoring metric given a regression problem.

  • Interpret and communicate the meanings of different scoring metrics on regression problems.

    • MSE, RMSE, \(R^2\), MAPE

  • Apply log-transform on the target values in a regression problem with TransformedTargetRegressor.

❓❓ Questions for you#

iClicker Exercise 2.1#

iClicker cloud join link: https://join.iclicker.com/C0P55

Select all of the following statements which are TRUE.

  • (A) Average precision score at threshold 0.5 is the same as F1-score.

  • (B) Using class_weight="balanced" is likely to decrease false negatives but increase false positives.

  • (C) In ROC curve, the threshold goes from lower left (1.0) to upper right (0.0).

  • (D) It’s possible that automated tools such as pandas-profiling classify a categorical feature as a numeric feature.

V’s answers: B, C, D

After carrying out preprocessing, why it’s useful to get feature names for transformed features?

More comments on tackling class imbalance#

  • In lecture 1 we talked about a few rather simple approaches to deal with class imbalance.

  • If you have a problem such as fraud detection problem where you want to spot rare events, you can also think of this problem as anomaly detection problem and use different kinds of algorithms such as isolation forests.

  • If you are interested in this area, it might be worth checking out this book on this topic. Imbalanced Learning: Foundations, Algorithms, and Applications

Note

When you calculate precision, recall, f1 score, by default only the positive label is evaluated, assuming by default that the positive class is labeled 1. This is configurable through the pos_label parameter.

Dataset [video]#

In this lecture, we’ll be using Kaggle House Prices dataset. As usual, to run this notebook you’ll need to download the data. For this dataset, train and test have already been separated. We’ll be working with the train portion in this lecture.

df = pd.read_csv("data/housing-kaggle/train.csv")
train_df, test_df = train_test_split(df, test_size=0.10, random_state=123)
train_df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
302 303 20 RL 118.0 13704 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 1 2006 WD Normal 205000
767 768 50 RL 75.0 12508 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 1300 7 2008 WD Normal 160000
429 430 20 RL 130.0 11457 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 3 2009 WD Normal 175000
1139 1140 30 RL 98.0 8731 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 144000
558 559 60 RL 57.0 21872 Pave NaN IR2 HLS AllPub ... 0 NaN NaN NaN 0 8 2008 WD Normal 175000

5 rows × 81 columns

  • The supervised machine learning problem is predicting housing price given features associated with properties.

  • Here, the target is SalePrice, which is continuous. So it’s a regression problem (as opposed to classification).

train_df.shape
(1314, 81)

Let’s separate X and y#

X_train = train_df.drop(columns=["SalePrice"])
y_train = train_df["SalePrice"]

X_test = test_df.drop(columns=["SalePrice"])
y_test = test_df["SalePrice"]

EDA#

train_df.describe()
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1314.000000 1314.000000 1089.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1307.000000 1314.000000 ... 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000 1314.000000
mean 734.182648 56.472603 69.641873 10273.261035 6.076104 5.570015 1970.995434 1984.659056 102.514155 441.425419 ... 94.281583 45.765601 21.726788 3.624049 13.987062 3.065449 46.951294 6.302131 2007.840183 179802.147641
std 422.224662 42.036646 23.031794 8997.895541 1.392612 1.112848 30.198127 20.639754 178.301563 459.276687 ... 125.436492 65.757545 60.766423 30.320430 53.854129 42.341109 522.283421 2.698206 1.332824 79041.260572
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 369.250000 20.000000 59.000000 7500.000000 5.000000 5.000000 1953.000000 1966.250000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129600.000000
50% 735.500000 50.000000 69.000000 9391.000000 6.000000 5.000000 1972.000000 1993.000000 0.000000 376.000000 ... 0.000000 24.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 162000.000000
75% 1099.750000 70.000000 80.000000 11509.000000 7.000000 6.000000 2000.000000 2004.000000 165.500000 704.750000 ... 168.000000 66.750000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 212975.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1378.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1314 entries, 302 to 1389
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1314 non-null   int64  
 1   MSSubClass     1314 non-null   int64  
 2   MSZoning       1314 non-null   object 
 3   LotFrontage    1089 non-null   float64
 4   LotArea        1314 non-null   int64  
 5   Street         1314 non-null   object 
 6   Alley          81 non-null     object 
 7   LotShape       1314 non-null   object 
 8   LandContour    1314 non-null   object 
 9   Utilities      1314 non-null   object 
 10  LotConfig      1314 non-null   object 
 11  LandSlope      1314 non-null   object 
 12  Neighborhood   1314 non-null   object 
 13  Condition1     1314 non-null   object 
 14  Condition2     1314 non-null   object 
 15  BldgType       1314 non-null   object 
 16  HouseStyle     1314 non-null   object 
 17  OverallQual    1314 non-null   int64  
 18  OverallCond    1314 non-null   int64  
 19  YearBuilt      1314 non-null   int64  
 20  YearRemodAdd   1314 non-null   int64  
 21  RoofStyle      1314 non-null   object 
 22  RoofMatl       1314 non-null   object 
 23  Exterior1st    1314 non-null   object 
 24  Exterior2nd    1314 non-null   object 
 25  MasVnrType     1307 non-null   object 
 26  MasVnrArea     1307 non-null   float64
 27  ExterQual      1314 non-null   object 
 28  ExterCond      1314 non-null   object 
 29  Foundation     1314 non-null   object 
 30  BsmtQual       1280 non-null   object 
 31  BsmtCond       1280 non-null   object 
 32  BsmtExposure   1279 non-null   object 
 33  BsmtFinType1   1280 non-null   object 
 34  BsmtFinSF1     1314 non-null   int64  
 35  BsmtFinType2   1280 non-null   object 
 36  BsmtFinSF2     1314 non-null   int64  
 37  BsmtUnfSF      1314 non-null   int64  
 38  TotalBsmtSF    1314 non-null   int64  
 39  Heating        1314 non-null   object 
 40  HeatingQC      1314 non-null   object 
 41  CentralAir     1314 non-null   object 
 42  Electrical     1313 non-null   object 
 43  1stFlrSF       1314 non-null   int64  
 44  2ndFlrSF       1314 non-null   int64  
 45  LowQualFinSF   1314 non-null   int64  
 46  GrLivArea      1314 non-null   int64  
 47  BsmtFullBath   1314 non-null   int64  
 48  BsmtHalfBath   1314 non-null   int64  
 49  FullBath       1314 non-null   int64  
 50  HalfBath       1314 non-null   int64  
 51  BedroomAbvGr   1314 non-null   int64  
 52  KitchenAbvGr   1314 non-null   int64  
 53  KitchenQual    1314 non-null   object 
 54  TotRmsAbvGrd   1314 non-null   int64  
 55  Functional     1314 non-null   object 
 56  Fireplaces     1314 non-null   int64  
 57  FireplaceQu    687 non-null    object 
 58  GarageType     1241 non-null   object 
 59  GarageYrBlt    1241 non-null   float64
 60  GarageFinish   1241 non-null   object 
 61  GarageCars     1314 non-null   int64  
 62  GarageArea     1314 non-null   int64  
 63  GarageQual     1241 non-null   object 
 64  GarageCond     1241 non-null   object 
 65  PavedDrive     1314 non-null   object 
 66  WoodDeckSF     1314 non-null   int64  
 67  OpenPorchSF    1314 non-null   int64  
 68  EnclosedPorch  1314 non-null   int64  
 69  3SsnPorch      1314 non-null   int64  
 70  ScreenPorch    1314 non-null   int64  
 71  PoolArea       1314 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          259 non-null    object 
 74  MiscFeature    50 non-null     object 
 75  MiscVal        1314 non-null   int64  
 76  MoSold         1314 non-null   int64  
 77  YrSold         1314 non-null   int64  
 78  SaleType       1314 non-null   object 
 79  SaleCondition  1314 non-null   object 
 80  SalePrice      1314 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 841.8+ KB

pandas_profiler#

We do not have pandas_profiling in our course environment. You will have to install it in the environment on your own if you want to run the code below.

conda install -c conda-forge pandas-profiling

# from pandas_profiling import ProfileReport

# profile = ProfileReport(train_df, title="Pandas Profiling Report")  # , minimal=True)
# profile.to_notebook_iframe()

Feature types#

  • Do not blindly trust all the info given to you by automated tools.

  • How does pandas profiling figure out the data type?

    • You can look at the Python data type and say floats are numeric, strings are categorical.

    • However, in doing so you would miss out on various subtleties such as some of the string features being ordinal rather than truly categorical.

    • Also, it will think free text is categorical.

  • In addition to tools such as above, it’s important to go through data description to understand the data.

  • The data description for our dataset is available here.

Feature types#

  • We have mixed feature types and a bunch of missing values.

  • Now, let’s identify feature types and transformations.

  • Let’s get the numeric-looking columns.

numeric_looking_columns = X_train.select_dtypes(include=np.number).columns.tolist()
print(numeric_looking_columns)
['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

Not all numeric looking columns are necessarily numeric.

train_df["MSSubClass"].unique()
array([ 20,  50,  30,  60, 160,  85,  90, 120, 180,  80,  70,  75, 190,
        45,  40])

MSSubClass: Identifies the type of dwelling involved in the sale.

    20	1-STORY 1946 & NEWER ALL STYLES
    30	1-STORY 1945 & OLDER
    40	1-STORY W/FINISHED ATTIC ALL AGES
    45	1-1/2 STORY - UNFINISHED ALL AGES
    50	1-1/2 STORY FINISHED ALL AGES
    60	2-STORY 1946 & NEWER
    70	2-STORY 1945 & OLDER
    75	2-1/2 STORY ALL AGES
    80	SPLIT OR MULTI-LEVEL
    85	SPLIT FOYER
    90	DUPLEX - ALL STYLES AND AGES
   120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150	1-1/2 STORY PUD - ALL AGES
   160	2-STORY PUD - 1946 & NEWER
   180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Also, month sold is more of a categorical feature than a numeric feature.

train_df["MoSold"].unique()  # Month Sold
array([ 1,  7,  3,  5,  8, 10,  6,  9, 12,  2,  4, 11])
drop_features = ["Id"]
numeric_features = [
    "BedroomAbvGr",
    "KitchenAbvGr",
    "LotFrontage",
    "LotArea",
    "OverallQual",
    "OverallCond",
    "YearBuilt",
    "YearRemodAdd",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BsmtFullBath",
    "BsmtHalfBath",
    "FullBath",
    "HalfBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageYrBlt",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    "YrSold",
]

Note

I’ve not looked at all the features carefully. It might be appropriate to apply some other encoding on some of the numeric features above.

set(numeric_looking_columns) - set(numeric_features) - set(drop_features)
{'MSSubClass', 'MoSold'}

We’ll treat the above numeric-looking features as categorical features.

  • There are a bunch of ordinal features in this dataset.

  • Ordinal features with the same scale

    • Poor (Po), Fair (Fa), Typical (TA), Good (Gd), Excellent (Ex)

    • These we’ll be calling ordinal_features_reg.

  • Ordinal features with different scales

    • These we’ll be calling ordinal_features_oth.

ordinal_features_reg = [
    "ExterQual",
    "ExterCond",
    "BsmtQual",
    "BsmtCond",
    "HeatingQC",
    "KitchenQual",
    "FireplaceQu",
    "GarageQual",
    "GarageCond",
    "PoolQC",
]
ordering = [
    "Po",
    "Fa",
    "TA",
    "Gd",
    "Ex",
]  # if N/A it will just impute something, per below
ordering_ordinal_reg = [ordering] * len(ordinal_features_reg)
ordering_ordinal_reg
[['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex']]

We’ll pass the above as categories in our OrdinalEncoder.

  • There are a bunch more ordinal features using different scales.

    • These we’ll be calling ordinal_features_oth.

    • We are encoding them separately.

ordinal_features_oth = [
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "Functional",
    "Fence",
]
ordering_ordinal_oth = [
    ["NA", "No", "Mn", "Av", "Gd"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"],
    ["NA", "MnWw", "GdWo", "MnPrv", "GdPrv"],
]

The remaining features are categorical features.

categorical_features = list(
    set(X_train.columns)
    - set(numeric_features)
    - set(ordinal_features_reg)
    - set(ordinal_features_oth)
    - set(drop_features)
)
categorical_features
['SaleType',
 'Neighborhood',
 'Foundation',
 'Street',
 'Exterior2nd',
 'RoofStyle',
 'MSZoning',
 'PavedDrive',
 'Utilities',
 'LotConfig',
 'SaleCondition',
 'LandSlope',
 'LotShape',
 'GarageFinish',
 'Condition2',
 'MasVnrType',
 'Exterior1st',
 'GarageType',
 'LandContour',
 'BldgType',
 'Electrical',
 'MSSubClass',
 'CentralAir',
 'Condition1',
 'RoofMatl',
 'Alley',
 'MiscFeature',
 'HouseStyle',
 'Heating',
 'MoSold']
  • We are not doing it here but we can engineer our own features too.

  • Would price per square foot be a good feature to add in here?

Applying feature transformations#

  • Since we have mixed feature types, let’s use ColumnTransformer to apply different transformations on different features types.

from sklearn.compose import ColumnTransformer, make_column_transformer

numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
ordinal_transformer_reg = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_reg),
)

ordinal_transformer_oth = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_oth),
)

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)

preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (ordinal_transformer_reg, ordinal_features_reg),
    (ordinal_transformer_oth, ordinal_features_oth),
    (categorical_transformer, categorical_features),
)

Examining the preprocessed data#

preprocessor.fit(X_train)  # Calling fit to examine all the transformers.
preprocessor.named_transformers_
{'drop': 'drop',
 'pipeline-1': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                 ('standardscaler', StandardScaler())]),
 'pipeline-2': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex']]))]),
 'pipeline-3': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['NA', 'No', 'Mn', 'Av', 'Gd'],
                                             ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ',
                                              'ALQ', 'GLQ'],
                                             ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ',
                                              'ALQ', 'GLQ'],
                                             ['Sal', 'Sev', 'Maj2', 'Maj1',
                                              'Mod', 'Min2', 'Min1', 'Typ'],
                                             ['NA', 'MnWw', 'GdWo', 'MnPrv',
                                              'GdPrv']]))]),
 'pipeline-4': Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='missing', strategy='constant')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse=False))])}
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)
X_train_enc = pd.DataFrame(
    preprocessor.transform(X_train), index=X_train.index, columns=new_columns
)
X_train_enc.head()
BedroomAbvGr KitchenAbvGr LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... MoSold_3 MoSold_4 MoSold_5 MoSold_6 MoSold_7 MoSold_8 MoSold_9 MoSold_10 MoSold_11 MoSold_12
302 0.154795 -0.222647 2.312501 0.381428 0.663680 -0.512408 0.993969 0.840492 0.269972 -0.961498 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
767 1.372763 -0.222647 0.260890 0.248457 -0.054669 1.285467 -1.026793 0.016525 -0.573129 0.476092 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
429 0.154795 -0.222647 2.885044 0.131607 -0.054669 -0.512408 0.563314 0.161931 -0.573129 1.227559 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1139 0.154795 -0.222647 1.358264 -0.171468 -0.773017 -0.512408 -1.689338 -1.679877 -0.573129 0.443419 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
558 0.154795 -0.222647 -0.597924 1.289541 0.663680 -0.512408 0.828332 0.598149 -0.573129 0.354114 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

5 rows × 263 columns

X_train.shape
(1314, 80)
X_train_enc.shape
(1314, 263)

We went from 80 features to 263 features!!

Other possible preprocessing?#

  • There is a lot of room for improvement …

  • We’re just using SimpleImputer.

    • In reality we’d want to go through this more carefully.

    • We may also want to drop some columns that are almost entirely missing.

  • We could also check for outliers, and do other exploratory data analysis (EDA).

  • But for now this is good enough …





Model building#

DummyRegressor#

dummy = DummyRegressor()
pd.DataFrame(cross_validate(dummy, X_train, y_train, cv=10, return_train_score=True))
fit_time score_time test_score train_score
0 0.000892 0.000188 -0.003547 0.0
1 0.000425 0.000095 -0.001266 0.0
2 0.000552 0.000102 -0.011767 0.0
3 0.000515 0.000100 -0.006744 0.0
4 0.000511 0.000119 -0.076533 0.0
5 0.000424 0.000094 -0.003133 0.0
6 0.000395 0.000090 -0.000397 0.0
7 0.000407 0.000089 -0.003785 0.0
8 0.000390 0.000086 -0.001740 0.0
9 0.000387 0.000085 -0.000117 0.0

Let’s try a linear model: Ridge#

  • Recall that we are going to use Ridge() instead of LinearRegression() in this course.

  • Similar to linear regression, ridge regression is also a linear model for regression.

  • So the formula it uses to make predictions is the same one used for ordinary least squares (which you’ll be learning in DSCI 561).

  • But it has a hyperparameter alpha which controls the fundamental tradeoff.

lr = make_pipeline(preprocessor, Ridge())
lr.fit(X_train, y_train);
lr_preds = lr.predict(X_test)
lr_preds[:10]
array([224795.63596803,  65063.79363463, 133408.78346075, 250878.88144979,
       128875.99852629, 205597.41832137, 334578.36315754, 159241.04183629,
       148367.15710315, 129278.59247656])
lr_preds.max(), lr_preds.min()
(411432.16086516465, 38827.24302274236)
print("Smallest coefficient: ", lr.named_steps["ridge"].coef_.min())
print("Largest coefficient:", lr.named_steps["ridge"].coef_.max())
Smallest coefficient:  -191129.77431368965
Largest coefficient: 83646.71100760289

Let’s carry out cross-validation with Ridge.

lr_pipe = make_pipeline(preprocessor, Ridge())
pd.DataFrame(cross_validate(lr_pipe, X_train, y_train, cv=10, return_train_score=True))
fit_time score_time test_score train_score
0 0.024186 0.005183 0.861355 0.911906
1 0.019860 0.004168 0.812301 0.913861
2 0.018951 0.003956 0.775283 0.915963
3 0.019101 0.003930 0.874519 0.910849
4 0.018593 0.004010 0.851969 0.911622
5 0.018503 0.004136 0.826198 0.910176
6 0.018924 0.004096 0.825533 0.913781
7 0.018370 0.003908 0.872238 0.910071
8 0.018573 0.003908 0.196663 0.921448
9 0.018850 0.003883 0.890474 0.908221
  • Quite a bit of variation in the test scores.

  • Performing poorly in fold 8. Not sure why.

    • Probably it contains the outliers in the data which we kind of ignored.

Feature names of transformed data#

  • If you want to get the column names of newly created columns, you need to fit the transformer.

preprocessor
ColumnTransformer(transformers=[('drop', 'drop', ['Id']),
                                ('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['BedroomAbvGr', 'KitchenAbvGr', 'LotFrontage',
                                  'LotArea', 'OverallQual', 'OverallCond',
                                  'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
                                  'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
                                  'TotalBsmtSF', '...
                                 ['SaleType', 'Neighborhood', 'Foundation',
                                  'Street', 'Exterior2nd', 'RoofStyle',
                                  'MSZoning', 'PavedDrive', 'Utilities',
                                  'LotConfig', 'SaleCondition', 'LandSlope',
                                  'LotShape', 'GarageFinish', 'Condition2',
                                  'MasVnrType', 'Exterior1st', 'GarageType',
                                  'LandContour', 'BldgType', 'Electrical',
                                  'MSSubClass', 'CentralAir', 'Condition1',
                                  'RoofMatl', 'Alley', 'MiscFeature',
                                  'HouseStyle', 'Heating', 'MoSold'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
ohe_columns
['SaleType_COD',
 'SaleType_CWD',
 'SaleType_Con',
 'SaleType_ConLD',
 'SaleType_ConLI',
 'SaleType_ConLw',
 'SaleType_New',
 'SaleType_Oth',
 'SaleType_WD',
 'Neighborhood_Blmngtn',
 'Neighborhood_Blueste',
 'Neighborhood_BrDale',
 'Neighborhood_BrkSide',
 'Neighborhood_ClearCr',
 'Neighborhood_CollgCr',
 'Neighborhood_Crawfor',
 'Neighborhood_Edwards',
 'Neighborhood_Gilbert',
 'Neighborhood_IDOTRR',
 'Neighborhood_MeadowV',
 'Neighborhood_Mitchel',
 'Neighborhood_NAmes',
 'Neighborhood_NPkVill',
 'Neighborhood_NWAmes',
 'Neighborhood_NoRidge',
 'Neighborhood_NridgHt',
 'Neighborhood_OldTown',
 'Neighborhood_SWISU',
 'Neighborhood_Sawyer',
 'Neighborhood_SawyerW',
 'Neighborhood_Somerst',
 'Neighborhood_StoneBr',
 'Neighborhood_Timber',
 'Neighborhood_Veenker',
 'Foundation_BrkTil',
 'Foundation_CBlock',
 'Foundation_PConc',
 'Foundation_Slab',
 'Foundation_Stone',
 'Foundation_Wood',
 'Street_Grvl',
 'Street_Pave',
 'Exterior2nd_AsbShng',
 'Exterior2nd_AsphShn',
 'Exterior2nd_Brk Cmn',
 'Exterior2nd_BrkFace',
 'Exterior2nd_CBlock',
 'Exterior2nd_CmentBd',
 'Exterior2nd_HdBoard',
 'Exterior2nd_ImStucc',
 'Exterior2nd_MetalSd',
 'Exterior2nd_Other',
 'Exterior2nd_Plywood',
 'Exterior2nd_Stone',
 'Exterior2nd_Stucco',
 'Exterior2nd_VinylSd',
 'Exterior2nd_Wd Sdng',
 'Exterior2nd_Wd Shng',
 'RoofStyle_Flat',
 'RoofStyle_Gable',
 'RoofStyle_Gambrel',
 'RoofStyle_Hip',
 'RoofStyle_Mansard',
 'RoofStyle_Shed',
 'MSZoning_C (all)',
 'MSZoning_FV',
 'MSZoning_RH',
 'MSZoning_RL',
 'MSZoning_RM',
 'PavedDrive_N',
 'PavedDrive_P',
 'PavedDrive_Y',
 'Utilities_AllPub',
 'Utilities_NoSeWa',
 'LotConfig_Corner',
 'LotConfig_CulDSac',
 'LotConfig_FR2',
 'LotConfig_FR3',
 'LotConfig_Inside',
 'SaleCondition_Abnorml',
 'SaleCondition_AdjLand',
 'SaleCondition_Alloca',
 'SaleCondition_Family',
 'SaleCondition_Normal',
 'SaleCondition_Partial',
 'LandSlope_Gtl',
 'LandSlope_Mod',
 'LandSlope_Sev',
 'LotShape_IR1',
 'LotShape_IR2',
 'LotShape_IR3',
 'LotShape_Reg',
 'GarageFinish_Fin',
 'GarageFinish_RFn',
 'GarageFinish_Unf',
 'GarageFinish_missing',
 'Condition2_Artery',
 'Condition2_Feedr',
 'Condition2_Norm',
 'Condition2_PosA',
 'Condition2_PosN',
 'Condition2_RRAe',
 'Condition2_RRAn',
 'Condition2_RRNn',
 'MasVnrType_BrkCmn',
 'MasVnrType_BrkFace',
 'MasVnrType_None',
 'MasVnrType_Stone',
 'MasVnrType_missing',
 'Exterior1st_AsbShng',
 'Exterior1st_AsphShn',
 'Exterior1st_BrkComm',
 'Exterior1st_BrkFace',
 'Exterior1st_CBlock',
 'Exterior1st_CemntBd',
 'Exterior1st_HdBoard',
 'Exterior1st_ImStucc',
 'Exterior1st_MetalSd',
 'Exterior1st_Plywood',
 'Exterior1st_Stone',
 'Exterior1st_Stucco',
 'Exterior1st_VinylSd',
 'Exterior1st_Wd Sdng',
 'Exterior1st_WdShing',
 'GarageType_2Types',
 'GarageType_Attchd',
 'GarageType_Basment',
 'GarageType_BuiltIn',
 'GarageType_CarPort',
 'GarageType_Detchd',
 'GarageType_missing',
 'LandContour_Bnk',
 'LandContour_HLS',
 'LandContour_Low',
 'LandContour_Lvl',
 'BldgType_1Fam',
 'BldgType_2fmCon',
 'BldgType_Duplex',
 'BldgType_Twnhs',
 'BldgType_TwnhsE',
 'Electrical_FuseA',
 'Electrical_FuseF',
 'Electrical_FuseP',
 'Electrical_Mix',
 'Electrical_SBrkr',
 'Electrical_missing',
 'MSSubClass_20',
 'MSSubClass_30',
 'MSSubClass_40',
 'MSSubClass_45',
 'MSSubClass_50',
 'MSSubClass_60',
 'MSSubClass_70',
 'MSSubClass_75',
 'MSSubClass_80',
 'MSSubClass_85',
 'MSSubClass_90',
 'MSSubClass_120',
 'MSSubClass_160',
 'MSSubClass_180',
 'MSSubClass_190',
 'CentralAir_N',
 'CentralAir_Y',
 'Condition1_Artery',
 'Condition1_Feedr',
 'Condition1_Norm',
 'Condition1_PosA',
 'Condition1_PosN',
 'Condition1_RRAe',
 'Condition1_RRAn',
 'Condition1_RRNe',
 'Condition1_RRNn',
 'RoofMatl_ClyTile',
 'RoofMatl_CompShg',
 'RoofMatl_Membran',
 'RoofMatl_Metal',
 'RoofMatl_Roll',
 'RoofMatl_Tar&Grv',
 'RoofMatl_WdShake',
 'RoofMatl_WdShngl',
 'Alley_Grvl',
 'Alley_Pave',
 'Alley_missing',
 'MiscFeature_Gar2',
 'MiscFeature_Othr',
 'MiscFeature_Shed',
 'MiscFeature_TenC',
 'MiscFeature_missing',
 'HouseStyle_1.5Fin',
 'HouseStyle_1.5Unf',
 'HouseStyle_1Story',
 'HouseStyle_2.5Fin',
 'HouseStyle_2.5Unf',
 'HouseStyle_2Story',
 'HouseStyle_SFoyer',
 'HouseStyle_SLvl',
 'Heating_Floor',
 'Heating_GasA',
 'Heating_GasW',
 'Heating_Grav',
 'Heating_OthW',
 'Heating_Wall',
 'MoSold_1',
 'MoSold_2',
 'MoSold_3',
 'MoSold_4',
 'MoSold_5',
 'MoSold_6',
 'MoSold_7',
 'MoSold_8',
 'MoSold_9',
 'MoSold_10',
 'MoSold_11',
 'MoSold_12']
new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)

Tuning alpha hyperparameter of Ridge#

  • Recall that Ridge has a hyperparameter alpha that controls the fundamental tradeoff.

  • This is like C in LogisticRegression but, annoyingly, alpha is the inverse of C.

  • That is, large C is like small alpha and vice versa.

  • Smaller alpha: lower training error (overfitting)

param_grid = {"ridge__alpha": 10.0 ** np.arange(-5, 5, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(pipe_ridge, param_grid, return_train_score=True, n_jobs=-1)
search.fit(X_train, y_train)
train_scores = search.cv_results_["mean_train_score"]
cv_scores = search.cv_results_["mean_test_score"]
plt.semilogx(param_grid["ridge__alpha"], train_scores.tolist(), label="train")
plt.semilogx(param_grid["ridge__alpha"], cv_scores.tolist(), label="cv")
plt.legend()
plt.xlabel("alpha")
plt.ylabel("score");
../_images/821999926dbfe92b86dbe4b1c06ed4d9f3162365fe81cc488251c6a747bf5c7e.png
best_alpha = search.best_params_
best_alpha
{'ridge__alpha': 100.0}
search.best_score_
0.7950686603241023
  • It seems alpha=100 is the best choice here.

  • General intuition: larger alpha leads to smaller coefficients.

  • Smaller coefficients mean the predictions are less sensitive to changes in the data. Hence less chance of overfitting.

pipe_bigalpha = make_pipeline(preprocessor, Ridge(alpha=1000))
pipe_bigalpha.fit(X_train, y_train)
bigalpha_coeffs = pipe_bigalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bigalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Coefficients
OverallQual 9689.632266
GrLivArea 7816.816972
1stFlrSF 5930.575341
TotRmsAbvGrd 5208.547961
GarageCars 5055.756475
... ...
SaleType_WD -1214.377135
GarageFinish_Unf -1280.655867
Foundation_CBlock -1767.116100
RoofStyle_Gable -1984.434183
KitchenAbvGr -2623.381280

263 rows × 1 columns

  • Smaller alpha leads to bigger coefficients.

pipe_smallalpha = make_pipeline(preprocessor, Ridge(alpha=0.01))
pipe_smallalpha.fit(X_train, y_train)
smallalpha_coeffs = pipe_smallalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=smallalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Coefficients
RoofMatl_WdShngl 128419.408757
RoofMatl_Membran 127334.605534
RoofMatl_Metal 106033.814922
Condition2_PosA 83475.834626
RoofMatl_CompShg 70577.031608
... ...
Exterior1st_ImStucc -34430.377551
Heating_OthW -34801.030771
Condition2_RRAe -63399.065969
Condition2_PosN -194897.224244
RoofMatl_ClyTile -582393.604620

263 rows × 1 columns

With the best alpha found by the grid search, the coefficients are somewhere in between.

pipe_bestalpha = make_pipeline(
    preprocessor, Ridge(alpha=search.best_params_["ridge__alpha"])
)
pipe_bestalpha.fit(X_train, y_train)
bestalpha_coeffs = pipe_bestalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bestalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Coefficients
OverallQual 14484.902165
GrLivArea 11704.053037
Neighborhood_NridgHt 9662.969631
Neighborhood_NoRidge 9497.598615
BsmtQual 8073.088562
... ...
RoofMatl_ClyTile -3992.399179
LandContour_Bnk -5001.996997
Neighborhood_Gilbert -5197.585536
Neighborhood_CollgCr -5467.463086
Neighborhood_Edwards -5796.508529

263 rows × 1 columns

To summarize:

  • Higher values of alpha means a more restricted model.

  • The values of coefficients are likely to be smaller for higher values of alpha compared to lower values of alpha.

RidgeCV#

Because it’s so common to want to tune alpha with Ridge, sklearn provides a class called RidgeCV, which automatically tunes alpha based on cross-validation.

alphas = 10.0 ** np.arange(-6, 6, 1)
ridgecv_pipe = make_pipeline(preprocessor, RidgeCV(alphas=alphas, cv=10))
ridgecv_pipe.fit(X_train, y_train);
best_alpha = ridgecv_pipe.named_steps["ridgecv"].alpha_

best_alpha
100.0



Let’s examine the tuned model.

ridge_tuned = make_pipeline(preprocessor, Ridge(alpha=best_alpha))
ridge_tuned.fit(X_train, y_train)
ridge_preds = ridge_tuned.predict(X_test)
ridge_preds[:10]
array([228728.1963872 , 104718.39905565, 155778.96723311, 246316.71119031,
       127633.10676873, 243207.19441128, 304930.24461291, 145374.59435295,
       157059.38983893, 128487.51979632])
df = pd.DataFrame(
    data={"coefficients": ridge_tuned.named_steps["ridge"].coef_}, index=new_columns
)
df.sort_values("coefficients", ascending=False)
coefficients
OverallQual 14484.902165
GrLivArea 11704.053037
Neighborhood_NridgHt 9662.969631
Neighborhood_NoRidge 9497.598615
BsmtQual 8073.088562
... ...
RoofMatl_ClyTile -3992.399179
LandContour_Bnk -5001.996997
Neighborhood_Gilbert -5197.585536
Neighborhood_CollgCr -5467.463086
Neighborhood_Edwards -5796.508529

263 rows × 1 columns

So according to this model:

  • As OverallQual feature gets bigger the housing price will get bigger.

  • Neighborhood_Edwards is associated with reducing the housing price.

    • We’ll talk more about interpretation of different kinds of features next week.

ridge_preds.max(), ridge_preds.min()
(390726.10647423257, 30791.092505420762)



❓❓ Questions for you#

iClicker Exercise 2.2#

iClicker cloud join link: https://join.iclicker.com/C0P55

Select all of the following statements which are TRUE.

  • (A) Price per square foot would be a good feature to add in our X.

  • (B) The alpha hyperparameter of Ridge has similar interpretation of C hyperparameter of LogisticRegression; higher alpha means more complex model.

  • (C) In Ridge, smaller alpha means bigger coefficients whereas bigger alpha means smaller coefficients.

V’s answers: C

Can we use the metrics we looked at in Lecture 1 for classification problems? Why or why not?





Regression scoring functions#

  • We aren’t doing classification anymore, so we can’t just check for equality:

ridge_tuned.predict(X_train) == y_train
302     False
767     False
429     False
1139    False
558     False
        ...  
1041    False
1122    False
1346    False
1406    False
1389    False
Name: SalePrice, Length: 1314, dtype: bool
y_train.values
array([205000, 160000, 175000, ..., 262500, 133000, 131000])
ridge_tuned.predict(X_train)
array([212894.62756285, 178502.78223444, 189937.18327372, ...,
       245233.6751565 , 129863.13373552, 135439.89186716])

We need a score that reflects how right/wrong each prediction is.

A number of popular scoring functions for regression. We are going to look at some common metrics:

  • mean squared error (MSE)

  • \(R^2\)

  • root mean squared error (RMSE)

  • MAPE

See sklearn documentation for more details.

Mean squared error (MSE)#

  • A common metric is mean squared error:

preds = ridge_tuned.predict(X_train)
np.mean((y_train - preds) ** 2)
873230473.3636105

Perfect predictions would have MSE=0.

np.mean((y_train - y_train) ** 2)
0.0

This is also implemented in sklearn:

from sklearn.metrics import mean_squared_error

mean_squared_error(y_train, preds)
873230473.3636105
  • MSE looks huge and unreasonable. There is an error of ~$1 Billion!

  • Is this score good or bad?

  • Unlike classification, with regression our target has units.

  • The target is in dollars, the mean squared error is in \(dollars^2\)

  • The score also depends on the scale of the targets.

  • If we were working in cents instead of dollars, our MSE would be \(10,000 \times (100^2\)) higher!

np.mean((y_train * 100 - preds * 100) ** 2)
8732304733636.104

Root mean squared error or RMSE#

  • The MSE above is in \(dollars^2\).

  • A more relatable metric would be the root mean squared error, or RMSE

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))
29550.473318774617
  • Error of $30,000 makes more sense.

  • Let’s dig deeper.

plt.scatter(y_train, ridge_tuned.predict(X_train), alpha=0.3)
grid = np.linspace(y_train.min(), y_train.max(), 1000)
plt.plot(grid, grid, "--k")
plt.xlabel("true price")
plt.ylabel("predicte price");
../_images/8d1bc013b9bac3d7610995d85fd521d904001e995dc4f69ba6c04a50a3a0d73d.png
  • Here we can see a few cases where our prediction is way off.

  • Is there something weird about those houses, perhaps? Outliers?

  • Under the line means we’re under-predicting, over the line means we’re over-predicting.

\(R^2\) (not in detail)#

A common score is the \(R^2\)

  • This is the score that sklearn uses by default when you call score()

  • You can read about it if interested.

  • \(R^2\) measures the proportion of variability in \(y\) that can be explained using \(X\).

  • You will learn about it in detail in DSCI 561.

  • Independent of the scale of \(y\). So the max is 1.

\[R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}\]
  • The denominator measures the total variance in \(y\).

  • The amount of variability that is left unexplained after performing regression.

Key points:

  • The maximum is 1 for perfect predictions

  • Negative values are very bad: “worse than DummyRegressor” (very bad)

(Optional) Warning: MSE is “reversible” but \(R^2\) is not:

mean_squared_error(y_train, preds)
873230473.3636105
mean_squared_error(preds, y_train)
873230473.3636105
r2_score(y_train, preds)
0.8601212294857902
r2_score(preds, y_train)
0.8279622258827066
  • When you call fit it minimizes MSE / maximizes \(R^2\) (or something like that) by default.

  • Just like in classification, this isn’t always what you want!!

MAPE#

  • We got an RMSE of ~$30,000 before.

Question: Is an error of $30,000 acceptable?

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))
29550.473318774617
  • For a house worth $600k, it seems reasonable! That’s 5% error.

  • For a house worth $60k, that is terrible. It’s 50% error.

We have both of these cases in our dataset.

plt.hist(y_train, bins=100);
../_images/919390e83bd855accae47193cc56911847be7c8bf55c758d54a28db99906971b.png

How about looking at percent error?

pred_train = ridge_tuned.predict(X_train)
percent_errors = (pred_train - y_train) / y_train * 100.0
percent_errors
302      3.851038
767     11.564239
429      8.535533
1139   -16.371069
558     17.177968
          ...    
1041    -0.496571
1122   -28.696351
1346    -6.577648
1406    -2.358546
1389     3.389230
Name: SalePrice, Length: 1314, dtype: float64

These are both positive (predict too high) and negative (predict too low).

We can look at the absolute percent error:

np.abs(percent_errors)
302      3.851038
767     11.564239
429      8.535533
1139    16.371069
558     17.177968
          ...    
1041     0.496571
1122    28.696351
1346     6.577648
1406     2.358546
1389     3.389230
Name: SalePrice, Length: 1314, dtype: float64

And, like MSE, we can take the average over examples. This is called mean absolute percent error (MAPE).

def my_mape(true, pred):
    return np.mean(np.abs((pred - true) / true))
my_mape(y_train, pred_train)
0.10093121294225234

Let’s use sklearn to calculate MAPE.

from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_train, pred_train)
0.10093121294225234
  • Ok, this is quite interpretable.

  • On average, we have around 10% error.

Transforming the targets#

  • When you have prices or count data, the target values are skewed.

  • Let’s look at our target column.

plt.hist(y_train, bins=100);
../_images/919390e83bd855accae47193cc56911847be7c8bf55c758d54a28db99906971b.png
  • A common trick in such cases is applying a log transform on the target column to make it more normal and less skewed.

  • That is, transform \(y\rightarrow \log(y)\).

  • Linear regression will usually work better on something that looks more normal.

plt.hist(np.log10(y_train), bins=100);
../_images/9beac995362ab89f27ff0412de5ea709357778c60b23fe4ce52efd5b1e233f6a.png

We can incorporate this in our pipeline using sklearn.

from sklearn.compose import TransformedTargetRegressor
ttr = TransformedTargetRegressor(
    Ridge(alpha=best_alpha), func=np.log1p, inverse_func=np.expm1
) # transformer for log transforming the target
ttr_pipe = make_pipeline(preprocessor, ttr)
ttr_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('drop', 'drop', ['Id']),
                                                 ('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['BedroomAbvGr',
                                                   'KitchenAbvGr',
                                                   'LotFrontage', 'LotArea',
                                                   'OverallQual', 'OverallCond',
                                                   'YearBuilt', 'YearRemodAdd',
                                                   'MasVnrArea', 'BsmtFinSF1',
                                                   'Bs...
                                                   'LotShape', 'GarageFinish',
                                                   'Condition2', 'MasVnrType',
                                                   'Exterior1st', 'GarageType',
                                                   'LandContour', 'BldgType',
                                                   'Electrical', 'MSSubClass',
                                                   'CentralAir', 'Condition1',
                                                   'RoofMatl', 'Alley',
                                                   'MiscFeature', 'HouseStyle',
                                                   'Heating', 'MoSold'])])),
                ('transformedtargetregressor',
                 TransformedTargetRegressor(func=<ufunc 'log1p'>,
                                            inverse_func=<ufunc 'expm1'>,
                                            regressor=Ridge(alpha=100.0)))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Why can’t we incorporate preprocessing targets in our column transformer?

ttr_pipe.fit(X_train, y_train); # y_train automatically transformed
ttr_pipe.predict(X_train)  # predictions automatically un-transformed
array([221355.29528077, 170663.43286226, 182608.09768702, ...,
       248575.94877669, 132148.9047652 , 133262.17638244])
mean_absolute_percentage_error(y_test, ttr_pipe.predict(X_test))
0.07808600924240865

We reduced MAPE from ~10% to ~8% with this trick!

  • Does .fit() know we care about MAPE?

  • No, it doesn’t. Why are we minimizing MSE (or something similar) if we care about MAPE??

  • When minimizing MSE, the expensive houses will dominate because they have the biggest error.

Different scoring functions with cross_validate#

  • Let’s try using MSE instead of the default \(R^2\) score.

pd.DataFrame(
    cross_validate(
        ridge_tuned,
        X_train,
        y_train,
        return_train_score=True,
        scoring="neg_mean_squared_error",
    )
)
fit_time score_time test_score train_score
0 0.020544 0.005940 -7.060346e+08 -9.383069e+08
1 0.023720 0.005778 -1.239851e+09 -8.267971e+08
2 0.017837 0.004770 -1.125125e+09 -8.763019e+08
3 0.018021 0.004723 -9.819320e+08 -8.847908e+08
4 0.018081 0.004630 -2.268434e+09 -7.397199e+08
# make a scorer function that we can pass into cross-validation
mape_scorer = make_scorer(my_mape, greater_is_better=False)

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=mape_scorer
    )
)
fit_time score_time test_score train_score
0 0.025276 0.005595 -0.096993 -0.104071
1 0.018330 0.005233 -0.108030 -0.099662
2 0.017748 0.004993 -0.118362 -0.101807
3 0.017986 0.004697 -0.107847 -0.102472
4 0.018004 0.005020 -0.121967 -0.098286

If you are finding greater_is_better=False argument confusing, here is the documentation:

greater_is_better(bool), default=True Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

Since our custom scorer mape gives an error and not a score, I’m passing False to it and it’ll sign flip so that we can interpret bigger numbers as better performance.

# ?make_scorer
scoring = {
    "r2": "r2",
    "mape_scorer": mape_scorer, # just for demonstration for a custom scorer
    "sklearn MAPE": "neg_mean_absolute_percentage_error",
    "neg_root_mean_square_error": "neg_root_mean_squared_error",
    "neg_mean_squared_error": "neg_mean_squared_error",
}

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=scoring
    )
).T
0 1 2 3 4
fit_time 2.498984e-02 1.821399e-02 1.756096e-02 1.761317e-02 1.713085e-02
score_time 5.598068e-03 5.260229e-03 5.133867e-03 4.958868e-03 5.253315e-03
test_r2 8.668969e-01 8.200460e-01 8.262644e-01 8.511854e-01 6.109505e-01
train_r2 8.551369e-01 8.636241e-01 8.579735e-01 8.561893e-01 8.834054e-01
test_mape_scorer -9.699277e-02 -1.080304e-01 -1.183620e-01 -1.078469e-01 -1.219672e-01
train_mape_scorer -1.040712e-01 -9.966190e-02 -1.018073e-01 -1.024720e-01 -9.828607e-02
test_sklearn MAPE -9.699277e-02 -1.080304e-01 -1.183620e-01 -1.078469e-01 -1.219672e-01
train_sklearn MAPE -1.040712e-01 -9.966190e-02 -1.018073e-01 -1.024720e-01 -9.828607e-02
test_neg_root_mean_square_error -2.657131e+04 -3.521152e+04 -3.354288e+04 -3.133579e+04 -4.762808e+04
train_neg_root_mean_square_error -3.063179e+04 -2.875408e+04 -2.960240e+04 -2.974543e+04 -2.719779e+04
test_neg_mean_squared_error -7.060346e+08 -1.239851e+09 -1.125125e+09 -9.819320e+08 -2.268434e+09
train_neg_mean_squared_error -9.383069e+08 -8.267971e+08 -8.763019e+08 -8.847908e+08 -7.397199e+08

Are we getting the same alpha with mape?

param_grid = {"ridge__alpha": 10.0 ** np.arange(-6, 6, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(
    pipe_ridge, param_grid, return_train_score=True, n_jobs=-1, scoring=mape_scorer
)
search.fit(X_train, y_train);
print("Best hyperparameter values: ", search.best_params_)
print("Best score: %0.3f" % (search.best_score_))
pd.DataFrame(search.cv_results_)[
    [
        "mean_train_score",
        "mean_test_score",
        "param_ridge__alpha",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T
Best hyperparameter values:  {'ridge__alpha': 100.0}
Best score: -0.111
rank_test_score 1 2 3 4 5 6 7 8 9 10 11 12
mean_train_score -0.10126 -0.111549 -0.096454 -0.090825 -0.085733 -0.085533 -0.085569 -0.085573 -0.085574 -0.085574 -0.203402 -0.332121
mean_test_score -0.11064 -0.115412 -0.116506 -0.122385 -0.12576 -0.12737 -0.127556 -0.127572 -0.127574 -0.127574 -0.204672 -0.332702
param_ridge__alpha 100.0 1000.0 10.0 1.0 0.1 0.01 0.001 0.0001 0.00001 0.000001 10000.0 100000.0
mean_fit_time 0.030413 0.031483 0.030522 0.025756 0.031604 0.034754 0.031737 0.025349 0.024416 0.021097 0.028194 0.026882

Using multiple metrics in GridSearchCV or RandomizedSearchCV#

  • We could use multiple metrics with GridSearchCV or RandomizedSearchCV.

  • But if you do so, you need to set refit to the metric (string) for which the best_params_ will be found and used to build the best_estimator_ on the whole dataset.

search_multi = GridSearchCV(
    pipe_ridge,
    param_grid,
    return_train_score=True,
    n_jobs=-1,
    scoring=scoring,
    refit="mape_scorer",
)
search_multi.fit(X_train, y_train);
print("Best hyperparameter values: ", search_multi.best_params_)
print("Best score: %0.3f" % (search_multi.best_score_))
pd.DataFrame(search_multi.cv_results_).set_index("rank_test_mape_scorer").sort_index()
Best hyperparameter values:  {'ridge__alpha': 100.0}
Best score: -0.111
mean_fit_time std_fit_time mean_score_time std_score_time param_ridge__alpha params split0_test_r2 split1_test_r2 split2_test_r2 split3_test_r2 ... mean_test_neg_mean_squared_error std_test_neg_mean_squared_error rank_test_neg_mean_squared_error split0_train_neg_mean_squared_error split1_train_neg_mean_squared_error split2_train_neg_mean_squared_error split3_train_neg_mean_squared_error split4_train_neg_mean_squared_error mean_train_neg_mean_squared_error std_train_neg_mean_squared_error
rank_test_mape_scorer
1 0.026089 0.007016 0.008769 0.003642 100.0 {'ridge__alpha': 100.0} 0.866897 0.820046 0.826264 0.851185 ... -1.264275e+09 5.328496e+08 1 -9.383069e+08 -8.267971e+08 -8.763019e+08 -8.847908e+08 -7.397199e+08 -8.531833e+08 6.685761e+07
2 0.032000 0.007684 0.007678 0.002079 1000.0 {'ridge__alpha': 1000.0} 0.827194 0.786532 0.802856 0.812304 ... -1.362306e+09 3.261402e+08 3 -1.282106e+09 -1.156442e+09 -1.207920e+09 -1.225280e+09 -1.030069e+09 -1.180363e+09 8.519271e+07
3 0.028621 0.004971 0.010537 0.004730 10.0 {'ridge__alpha': 10.0} 0.858957 0.820552 0.832899 0.846307 ... -1.284019e+09 5.507493e+08 2 -7.031807e+08 -6.276881e+08 -6.910603e+08 -6.699413e+08 -5.916073e+08 -6.566956e+08 4.146121e+07
4 0.027580 0.004528 0.009929 0.003815 1.0 {'ridge__alpha': 1.0} 0.834450 0.810031 0.831653 0.843537 ... -1.388085e+09 6.365356e+08 4 -5.387270e+08 -4.897545e+08 -5.405067e+08 -5.288482e+08 -5.045273e+08 -5.204727e+08 2.001295e+07
5 0.025182 0.005201 0.006671 0.001477 0.1 {'ridge__alpha': 0.1} 0.819830 0.792735 0.805207 0.863338 ... -1.468319e+09 6.909866e+08 5 -4.573403e+08 -4.008696e+08 -4.345403e+08 -4.511181e+08 -4.845183e+08 -4.456773e+08 2.758988e+07
6 0.026737 0.005319 0.009814 0.005536 0.01 {'ridge__alpha': 0.01} 0.822784 0.782073 0.787900 0.867310 ... -1.504687e+09 7.065130e+08 6 -4.516497e+08 -3.925934e+08 -4.251121e+08 -4.441832e+08 -4.835932e+08 -4.394263e+08 3.006982e+07
7 0.028774 0.004413 0.010363 0.004246 0.001 {'ridge__alpha': 0.001} 0.824481 0.780386 0.785360 0.867572 ... -1.508977e+09 7.092363e+08 7 -4.515516e+08 -3.924490e+08 -4.249579e+08 -4.440666e+08 -4.835790e+08 -4.393208e+08 3.011365e+07
8 0.029260 0.004155 0.009032 0.003749 0.0001 {'ridge__alpha': 0.0001} 0.824671 0.780205 0.785095 0.867595 ... -1.509418e+09 7.095221e+08 8 -4.515504e+08 -3.924474e+08 -4.249563e+08 -4.440653e+08 -4.835789e+08 -4.393197e+08 3.011413e+07
9 0.027493 0.002594 0.010081 0.002627 0.00001 {'ridge__alpha': 1e-05} 0.824691 0.780187 0.785068 0.867597 ... -1.509463e+09 7.095507e+08 9 -4.515504e+08 -3.924474e+08 -4.249562e+08 -4.440653e+08 -4.835789e+08 -4.393197e+08 3.011413e+07
10 0.032921 0.006119 0.006292 0.000163 0.000001 {'ridge__alpha': 1e-06} 0.824693 0.780185 0.785065 0.867597 ... -1.509467e+09 7.095536e+08 10 -4.515504e+08 -3.924474e+08 -4.249562e+08 -4.440653e+08 -4.835789e+08 -4.393197e+08 3.011413e+07
11 0.032045 0.006053 0.008120 0.004116 10000.0 {'ridge__alpha': 10000.0} 0.593402 0.551970 0.568663 0.557891 ... -2.686107e+09 3.315718e+08 11 -2.761798e+09 -2.571596e+09 -2.662233e+09 -2.624434e+09 -2.595316e+09 -2.643075e+09 6.663313e+07
12 0.026692 0.004300 0.007573 0.003461 100000.0 {'ridge__alpha': 100000.0} 0.135414 0.113850 0.118889 0.123419 ... -5.442218e+09 5.531025e+08 12 -5.629636e+09 -5.271544e+09 -5.379227e+09 -5.342554e+09 -5.497948e+09 -5.424182e+09 1.262024e+08

12 rows × 80 columns

What’s the test score?

search_multi.score(X_test, y_test)
-0.09496387589495985
my_mape(y_test, ridge_tuned.predict(X_test))
0.09496387589495985

Using regression metrics with scikit-learn#

  • In sklearn, you will notice that it has negative version of the metrics above (e.g., neg_mean_squared_error, neg_root_mean_squared_error).

  • The reason for this is that scores return a value to maximize, the higher the better.

❓❓ Questions for you#

iClicker Exercise 2.3#

iClicker cloud join link: https://join.iclicker.com/C0P55

Select all of the following statements which are TRUE.

  • (A) We can use still use precision and recall for regression problems but now we have other metrics we can use as well.

  • (B) In sklearn for regression problems, using r2_score() and .score() (with default values) will produce the same results.

  • (C) RMSE is always going to be non-negative.

  • (D) MSE does not directly provide the information about whether the model is underpredicting or overpredicting.

  • (E) We can pass multiple scoring metrics to GridSearchCV or RandomizedSearchCV for regression as well as classification problems.

V’s answers: B, C, D, E





What did we learn today?#

  • House prices dataset target is price, which is numeric -> regression rather than classification

  • There are corresponding versions of all the tools we used:

    • DummyClassifier -> DummyRegressor

    • LogisticRegression -> Ridge

  • Ridge hyperparameter alpha is like LogisticRegression hyperparameter C, but opposite meaning

  • We’ll avoid LinearRegression in this course.

  • Scoring metrics

  • \(R^2\) is the default .score(), it is unitless, 0 is bad, 1 is best

  • MSE (mean squared error) is in units of target squared, hard to interpret; 0 is best

  • RMSE (root mean squared error) is in the same units as the target; 0 is best

  • MAPE (mean absolute percent error) is unitless; 0 is best, 1 is bad