Introduction to Machine Learning

Visualizing model complexity using decision boundaries

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1	quiz2
0	1	1	92	93	84	91	92	A+
1	1	0	94	90	80	83	91	not A+
2	0	0	78	85	83	80	80	not A+
...	...	...	...	...	...	...	...	...
18	1	1	91	93	90	88	82	not A+
19	0	1	77	94	87	81	89	not A+
20	1	1	96	92	92	96	87	A+

21 rows × 8 columns

X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

X_subset = X[["lab4", "quiz1"]]
X_subset.head()

	lab4	quiz1
0	91	92
1	83	91
2	80	80
3	91	89
4	92	85

from sklearn.tree import DecisionTreeClassifier

depth = 1
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)

0.7142857142857143

depth = 2
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)

0.8095238095238095

depth = 4
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)

0.9523809523809523

model.score(X_subset, y)

0.9523809523809523

depth = 10
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
model.score(X_subset, y)

1.0

model.score(X_subset, y)

1.0

Fundamental goal of machine learning

Generalizing to unseen data

Training score versus Generalization score

Given a model in machine learning, people usually talk about two kinds of accuracies (scores):

Accuracy on the training data
Accuracy on the entire distribution of data

We saw with depth 10 we could get perfect accuracy of 1 but what makes ML hard is that we only have access to a sample and not the full data distribution.

For example, in our toy quiz 2 classification problem we only had 21 examples and 7 features so there could be many more possible examples.

We were expected to make a reasonable model that made reasonable predictions with only 21 examples from several possible options.

The question is when we get an accuracy of 1, on limited data, can we really trust the training accuracy?

Would you deploy this model and expect it to perform reasonably on unseen examples? Probably not.

This is why in machine learning people usually talk about 2 types of scores.

Scores on the training data and score on the entire distribution.

We are really interested in the score on the entire distribution because at the end of the day we want our model to perform well on unseen examples.

But the problem is that we do not have access to the distribution and only the limited training data that is given to us.

So, what do we do?

We will cover this, in the next module.