Introduction to Machine Learning – Regression Measurements

from sklearn.model_selection import train_test_split

housing_df = pd.read_csv("data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]
X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

numeric_features = [ "longitude", "latitude",
                     "housing_median_age",
                     "households", "median_income",
                     "rooms_per_household",
                     "bedrooms_per_household",
                     "population_per_household"]
                     
categorical_features = ["ocean_proximity"]

X_train.head(3)

	longitude	latitude	housing_median_age	households	...	ocean_proximity	rooms_per_household	bedrooms_per_household	population_per_household
6051	-117.75	34.04	22.0	602.0	...	INLAND	4.897010	1.056478	4.318937
20113	-119.57	37.94	17.0	20.0	...	INLAND	17.300000	6.500000	2.550000
14289	-117.13	32.74	46.0	708.0	...	NEAR OCEAN	4.738701	1.084746	2.057910

3 rows × 9 columns

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsRegressor

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
           ("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = make_column_transformer(
(numeric_transformer, numeric_features),
        (categorical_transformer, categorical_features), 
    remainder='passthrough')

pipe = make_pipeline(preprocessor, KNeighborsRegressor())
pipe.fit(X_train, y_train);

predicted_y = pipe.predict(X_train)

predicted_y == y_train

6051     False
20113    False
14289    False
         ...  
17730    False
15725    False
19966    False
Name: median_house_value, Length: 18576, dtype: bool

y_train.values

array([113600., 137500., 170100., ..., 286200., 412500.,  59300.], shape=(18576,))

predicted_y

array([111740., 117380., 187700., ..., 271420., 265180.,  60860.], shape=(18576,))

Regression measurements

The scores we are going to discuss are:

mean squared error (MSE)
R²
root mean squared error (RMSE)
MAPE

If you want to see these in more detail, you can refer to the sklearn documentation.

Mean squared error (MSE)

predicted_y

array([111740., 117380., 187700., ..., 271420., 265180.,  60860.], shape=(18576,))

np.mean((y_train - predicted_y)**2)

np.float64(2570054492.048064)

np.mean((y_train - y_train)**2)

np.float64(0.0)

from sklearn.metrics import mean_squared_error

mean_squared_error(y_train, predicted_y)

2570054492.048064

R² (quick notes)

Key points:

The maximum value possible is 1 which means the model has perfect predictions.
Negative values are very bad: “worse than baseline models such asDummyRegressor”.

from sklearn.metrics import r2_score

print(mean_squared_error(y_train, predicted_y))
print(mean_squared_error(predicted_y, y_train))

2570054492.048064
2570054492.048064

print(r2_score(y_train, predicted_y))
print(r2_score(predicted_y, y_train))

0.8059396097446094
0.742915970464153

Root mean squared error (RMSE)

mean_squared_error(y_train, predicted_y)

2570054492.048064

np.sqrt(mean_squared_error(y_train, predicted_y))

np.float64(50695.704867849156)

MAPE - Mean Absolute Percent Error (MAPE)

percent_errors = (predicted_y - y_train)/y_train * 100.
percent_errors.head()

6051     -1.637324
20113   -14.632727
14289    10.346855
13665     6.713070
14471   -10.965854
Name: median_house_value, dtype: float64

np.abs(percent_errors).head()

6051      1.637324
20113    14.632727
14289    10.346855
13665     6.713070
14471    10.965854
Name: median_house_value, dtype: float64

100.*np.mean(np.abs((predicted_y - y_train)/y_train))

np.float64(18.19299750298522)

Regression Measurements

Regression measurements

Mean squared error (MSE)

R2 (quick notes)

Root mean squared error (RMSE)

MAPE - Mean Absolute Percent Error (MAPE)

Let’s apply what we learned!

R² (quick notes)