Decision trees with continuous features

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head(3)
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1 quiz2
0 1 1 92 93 84 91 92 A+
1 1 0 94 90 80 83 91 not A+
2 0 0 78 85 83 80 80 not A+


X = classification_df.drop(columns=["quiz2"])
X.head(3)
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92
1 1 0 94 90 80 83 91
2 0 0 78 85 83 80 80


y = classification_df["quiz2"]
y.head(3)
0        A+
1    not A+
2    not A+
Name: quiz2, dtype: object
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)
404 image
X_subset = X[["lab4", "quiz1"]]
X_subset.head()
lab4 quiz1
0 91 92
1 83 91
2 80 80
3 91 89
4 92 85

Decision boundaries

depth = 1
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
404 image

404 image

Another example of decision boundaries

df = pd.read_csv('data/canada_usa_cities.csv')
df.head()
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
3 -122.7436 48.9881 USA
4 -122.2691 48.9951 USA
chart1 = alt.Chart(df).mark_circle(size=20, opacity=0.6).encode(
    alt.X('longitude:Q', scale=alt.Scale(domain=[-140, -40]), axis=alt.Axis(grid=False)),
    alt.Y('latitude:Q', scale=alt.Scale(domain=[20, 60]), axis=alt.Axis(grid=False)),
    alt.Color('country:N', scale=alt.Scale(domain=['Canada', 'USA'],
                                           range=['red', 'blue']))
)
chart1

Real boundary between Canada and USA

Attribution: sovereignlimits.com

Let’s apply what we learned!