Introduction to Machine Learning – Precision, Recall and F1 Score

Accuracy is only part of the story…

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

pipe_tree = make_pipeline(
    (StandardScaler()),
    (DecisionTreeClassifier(random_state=123))
)

from sklearn.model_selection import cross_validate
pd.DataFrame(cross_validate(pipe_tree, X_train, y_train, return_train_score=True)).mean()

fit_time       9.988710
score_time     0.005139
test_score     0.999119
train_score    1.000000
dtype: float64

y_train.value_counts(normalize=True)

Class
0    0.998302
1    0.001698
Name: proportion, dtype: float64

from sklearn.metrics import confusion_matrix

pipe_tree.fit(X_train,y_train);
predictions = pipe_tree.predict(X_valid)
confusion_matrix(y_valid, predictions)

array([[59674,    34],
       [   26,    76]])

TN, FP, FN, TP = confusion_matrix(y_valid, predictions).ravel()

Recall

Among all positive examples, how many did you identify?

confusion_matrix(y_valid, predictions)

array([[59674,    34],
       [   26,    76]])

TN, FP, FN, TP = confusion_matrix(y_valid, predictions).ravel()

recall = TP / (TP + FN)
recall.round(4)

np.float64(0.7451)

Precision

Among the positive examples you identified, how many were actually positive?

confusion_matrix(y_valid, predictions)

array([[59674,    34],
       [   26,    76]])

TN, FP, FN, TP = confusion_matrix(y_valid, predictions).ravel()

precision = TP / (TP + FP)
precision.round(4)

np.float64(0.6909)

f1

f1-score combines precision and recall to give one score.

precision

np.float64(0.6909090909090909)

recall

np.float64(0.7450980392156863)

f1_score = (2 * precision * recall) / (precision + recall)
f1_score

np.float64(0.7169811320754716)

Calculate evaluation metrics by ourselves and with sklearn

data = {}
data["accuracy"] = [(TP + TN) / (TN + FP + FN + TP)]
data["error"] = [(FP + FN) / (TN + FP + FN + TP)]
data["precision"] = [ TP / (TP + FP)] 
data["recall"] = [TP / (TP + FN)] 
data["f1 score"] = [(2 * precision * recall) / (precision + recall)] 
measures_df = pd.DataFrame(data, index=['ourselves'])

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

pred_cv =  pipe_tree.predict(X_valid) 

data["accuracy"].append(accuracy_score(y_valid, pred_cv))
data["error"].append(1 - accuracy_score(y_valid, pred_cv))
data["precision"].append(precision_score(y_valid, pred_cv, zero_division=1))
data["recall"].append(recall_score(y_valid, pred_cv))
data["f1 score"].append(f1_score(y_valid, pred_cv))

pd.DataFrame(data, index=['ourselves', 'sklearn'])

	accuracy	error	precision	recall	f1 score
ourselves	0.998997	0.001003	0.690909	0.745098	0.716981
sklearn	0.998997	0.001003	0.690909	0.745098	0.716981

Classification report

from sklearn.metrics import classification_report

pipe_tree.classes_

array([0, 1])

print(classification_report(y_valid, pipe_tree.predict(X_valid),
        target_names=["non-fraud", "fraud"]))

              precision    recall  f1-score   support

   non-fraud       1.00      1.00      1.00     59708
       fraud       0.69      0.75      0.72       102

    accuracy                           1.00     59810
   macro avg       0.85      0.87      0.86     59810
weighted avg       1.00      1.00      1.00     59810

See here for full size.