Introduction to Machine Learning – Introducing Evaluation Metrics

cc_df = pd.read_csv('data/creditcard.csv.zip', encoding='latin-1')
train_df, test_df = train_test_split(cc_df, test_size=0.3, random_state=111)

train_df.head()

	Time	V1	V2	V3	...	V27	V28	Amount
64454	51150.0	-3.538816	3.481893	-1.827130	...	-0.023636	-0.454966	1.00
37906	39163.0	-0.363913	0.853399	1.648195	...	-0.186814	-0.257103	18.49
79378	57994.0	1.193021	-0.136714	0.622612	...	-0.036764	0.015039	23.74
245686	152859.0	1.604032	-0.808208	-1.594982	...	0.005387	-0.057296	156.52
60943	49575.0	-2.669614	-2.734385	0.662450	...	0.388023	0.161782	57.50

5 rows × 31 columns

train_df.shape

(199364, 31)

train_df.describe(include="all", percentiles = [])

	Time	V1	V2	V3	...	V27	V28	Amount	Class
count	199364.000000	199364.000000	199364.000000	199364.000000	...	199364.000000	199364.000000	199364.000000	199364.000000
mean	94888.815669	0.000492	-0.000726	0.000927	...	-0.000366	0.000227	88.164679	0.001700
std	47491.435489	1.959870	1.645519	1.505335	...	0.401541	0.333139	238.925768	0.041201
min	0.000000	-56.407510	-72.715728	-31.813586	...	-22.565679	-11.710896	0.000000	0.000000
50%	84772.500000	0.018854	0.065463	0.179080	...	0.001239	0.011234	22.000000	0.000000
max	172792.000000	2.451888	22.057729	9.382558	...	12.152401	33.847808	11898.090000	1.000000

6 rows × 31 columns

X_train_big, y_train_big = train_df.drop(columns=["Class"]), train_df["Class"]
X_test, y_test = test_df.drop(columns=["Class"]), test_df["Class"]

X_train, X_valid, y_train, y_valid = train_test_split(X_train_big, 
                                                      y_train_big, 
                                                      test_size=0.3, 
                                                      random_state=123)

Baseline

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy = DummyClassifier(strategy="most_frequent")
pd.DataFrame(cross_validate(dummy, X_train, y_train, return_train_score=True)).mean()

fit_time       0.010974
score_time     0.000922
test_score     0.998302
train_score    0.998302
dtype: float64

train_df["Class"].value_counts(normalize=True)

Class
0    0.9983
1    0.0017
Name: proportion, dtype: float64

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

pipe = make_pipeline(
       (StandardScaler()),
       (DecisionTreeClassifier(random_state=123))
)

pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True)).mean()

fit_time       9.942893
score_time     0.005338
test_score     0.999119
train_score    1.000000
dtype: float64

What is “positive” and “negative”?

train_df["Class"].value_counts(normalize=True)

Class
0    0.9983
1    0.0017
Name: proportion, dtype: float64

There are two kinds of binary classification problems:

Distinguishing between two classes
Spotting a class (fraud transaction, spam, disease)

Confusion Matrix

pipe.fit(X_train, y_train);

from sklearn.metrics import  ConfusionMatrixDisplay
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_estimator(pipe, X_valid, y_valid, display_labels=["Non fraud", "Fraud"], values_format="d", cmap="Blues");
plt.show()

X	predict negative	predict positive
negative example	True negative (TN)	False positive (FP)
positive example	False negative (FN)	True positive (TP)

Remember the Fraud is considered “positive” in this case and non-fraud is considered “negative”.

Here the 4 quadrants for this problem are explained below. These positions will change depending on what values we deem as the positive label.

True negative (TN): Examples that are negatively labeled that the model correctly predicts. This is in the top left quadrant.
False positive (FP): Examples that are negatively labeled that the model incorrectly predicts as positive. This is in the top right quadrant.
False negative (FN): Examples that are positively labeled that the model incorrectly predicts as negative. This is in the bottom left quadrant.
True positive (TP): Examples that are positively labeled that the model correctly predicted as positive This is in the bottom right quadrant.

from sklearn.metrics import confusion_matrix

predictions = pipe.predict(X_valid)
confusion_matrix(y_valid, predictions)

array([[59674,    34],
       [   26,    76]])

Introducing Evaluation Metrics

Baseline

What is “positive” and “negative”?

Confusion Matrix

Let’s apply what we learned!