Generalization

Visualizing model complexity using decision boundaries

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1 quiz2
0 1 1 92 93 84 91 92 A+
1 1 0 94 90 80 83 91 not A+
2 0 0 78 85 83 80 80 not A+
... ... ... ... ... ... ... ... ...
18 1 1 91 93 90 88 82 not A+
19 0 1 77 94 87 81 89 not A+
20 1 1 96 92 92 96 87 A+

21 rows × 8 columns

X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]


X_subset = X[["lab4", "quiz1"]]
X_subset.head()
lab4 quiz1
0 91 92
1 83 91
2 80 80
3 91 89
4 92 85
from sklearn.tree import DecisionTreeClassifier

depth = 1
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)
0.7142857142857143


404 image

404 image

depth = 2
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)
0.8095238095238095
404 image

404 image

depth = 4
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)
0.9523809523809523
404 image
model.score(X_subset, y)
0.9523809523809523
depth = 10
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)
1.0
404 image
model.score(X_subset, y)
1.0

Fundamental goal of machine learning

404 image

Generalizing to unseen data

404 image

Training score versus Generalization score

Given a model in machine learning, people usually talk about two kinds of accuracies (scores):

  1. Accuracy on the training data

  2. Accuracy on the entire distribution of data

Let’s apply what we learned!