Regression Measurements

from sklearn.model_selection import train_test_split

housing_df = pd.read_csv("data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)
X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]
X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

numeric_features = [ "longitude", "latitude",
                     "housing_median_age",
                     "households", "median_income",
                     "rooms_per_household",
                     "bedrooms_per_household",
                     "population_per_household"]
                     
categorical_features = ["ocean_proximity"]

X_train.head(3)
longitude latitude housing_median_age households ... ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 ... INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 ... INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 ... NEAR OCEAN 4.738701 1.084746 2.057910

3 rows × 9 columns

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsRegressor
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
           ("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = make_column_transformer(
(numeric_transformer, numeric_features),
        (categorical_transformer, categorical_features), 
    remainder='passthrough')

pipe = make_pipeline(preprocessor, KNeighborsRegressor())
pipe.fit(X_train, y_train);
predicted_y = pipe.predict(X_train) 


predicted_y == y_train
6051     False
20113    False
14289    False
         ...  
17730    False
15725    False
19966    False
Name: median_house_value, Length: 18576, dtype: bool


y_train.values
array([113600., 137500., 170100., ..., 286200., 412500.,  59300.], shape=(18576,))


predicted_y
array([111740., 117380., 187700., ..., 271420., 265180.,  60860.], shape=(18576,))

Regression measurements

The scores we are going to discuss are:

  • mean squared error (MSE)
  • R2
  • root mean squared error (RMSE)
  • MAPE

If you want to see these in more detail, you can refer to the sklearn documentation.

Mean squared error (MSE)

404 image

404 image

predicted_y
array([111740., 117380., 187700., ..., 271420., 265180.,  60860.], shape=(18576,))


np.mean((y_train - predicted_y)**2)
np.float64(2570054492.048064)


np.mean((y_train - y_train)**2)
np.float64(0.0)
from sklearn.metrics import mean_squared_error 


mean_squared_error(y_train, predicted_y)
2570054492.048064

R2 (quick notes)

Key points:

  • The maximum value possible is 1 which means the model has perfect predictions.
  • Negative values are very bad: “worse than baseline models such asDummyRegressor”.
from sklearn.metrics import r2_score
print(mean_squared_error(y_train, predicted_y))
print(mean_squared_error(predicted_y, y_train))
2570054492.048064
2570054492.048064


print(r2_score(y_train, predicted_y))
print(r2_score(predicted_y, y_train))
0.8059396097446094
0.742915970464153

Root mean squared error (RMSE)

404 image

404 image

mean_squared_error(y_train, predicted_y)
2570054492.048064


np.sqrt(mean_squared_error(y_train, predicted_y))
np.float64(50695.704867849156)

MAPE - Mean Absolute Percent Error (MAPE)

percent_errors = (predicted_y - y_train)/y_train * 100.
percent_errors.head()
6051     -1.637324
20113   -14.632727
14289    10.346855
13665     6.713070
14471   -10.965854
Name: median_house_value, dtype: float64


np.abs(percent_errors).head()
6051      1.637324
20113    14.632727
14289    10.346855
13665     6.713070
14471    10.965854
Name: median_house_value, dtype: float64


100.*np.mean(np.abs((predicted_y - y_train)/y_train))
np.float64(18.19299750298522)

Let’s apply what we learned!