Introduction to Machine Learning

Recap

Training score versus generalization score

Given a model, in Machine Learning (ML), people usually talk about two kinds of scores (accuracies):

Score on the training data
Score on the entire distribution of data

We can approximate generalization accuracy by splitting our data!

Simple train and test split

Attribution

cities_df = pd.read_csv("data/canada_usa_cities.csv")
cities_df

	longitude	latitude	country
0	-130.0437	55.9773	USA
1	-134.4197	58.3019	USA
2	-123.0780	48.9854	USA
...	...	...	...
206	-79.2506	42.9931	Canada
207	-72.9406	45.6275	Canada
208	-79.4608	46.3092	Canada

209 rows × 3 columns

X = cities_df.drop(columns=["country"])
X

	longitude	latitude
0	-130.0437	55.9773
1	-134.4197	58.3019
2	-123.0780	48.9854
...	...	...
206	-79.2506	42.9931
207	-72.9406	45.6275
208	-79.4608	46.3092

209 rows × 2 columns

y = cities_df["country"]
y

0         USA
1         USA
2         USA
        ...  
206    Canada
207    Canada
208    Canada
Name: country, Length: 209, dtype: object

from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

X_train.head(3)

	longitude	latitude
160	-76.4813	44.2307
127	-81.2496	42.9837
169	-66.0580	45.2788

X_test.head(3)

	longitude	latitude
172	-64.8001	46.0980
175	-82.4066	42.9746
181	-111.3885	56.7292

y_train.head(3)

160    Canada
127    Canada
169    Canada
Name: country, dtype: object

y_test.head(3)

172    Canada
175    Canada
181    Canada
Name: country, dtype: object

shape_dict = {"Data portion": ["X", "y", "X_train", "y_train", "X_test", "y_test"],
    "Shape": [X.shape, y.shape,
              X_train.shape, y_train.shape,
              X_test.shape, y_test.shape]}

shape_df = pd.DataFrame(shape_dict)
shape_df

	Data portion	Shape
0	X	(209, 2)
1	y	(209,)
2	X_train	(167, 2)
3	y_train	(167,)
4	X_test	(42, 2)
5	y_test	(42,)

train_df, test_df = train_test_split(cities_df, test_size = 0.2, random_state = 123)

X_train, y_train = train_df.drop(columns=["country"]), train_df["country"]

X_test, y_test = test_df.drop(columns=["country"]), test_df["country"]

train_df.head()

	longitude	latitude	country
160	-76.4813	44.2307	Canada
127	-81.2496	42.9837	Canada
169	-66.0580	45.2788	Canada
188	-73.2533	45.3057	Canada
187	-67.9245	47.1652	Canada

chart_cities = alt.Chart(train_df).mark_circle(size=20, opacity=0.6).encode(
    alt.X('longitude:Q', scale=alt.Scale(domain=[-140, -40])),
    alt.Y('latitude:Q', scale=alt.Scale(domain=[20, 60])),
    alt.Color('country:N', scale=alt.Scale(domain=['Canada', 'USA'],
                                           range=['red', 'blue'])))
chart_cities

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train);

For example, we see here, if the latitude is less than 42.9. and say it’s the latitude is also less than 42.096, the model predicts USA etc.

Here we have the decision tree on the left and a picture of it on the right that corresponds to the decision boundaries.

The first split corresponds to the latitude column using a threshold of 42.9.

if latitude is greater than 42.9, then the model will be predicting above the horizontal line that lines upright with 42.9 on the y-axis which is the latitude axis.

And if the statement is true and latitude is less than 42.9. The prediction corresponds to the bottom half of this plot.

We see here that the bottom half is splitting again but both sides are being predicted as USA making the entire bottom half of the plot red, corresponding to USA predictions.

It doesn’t completely make sense to split and then predict the same thing on both sides but the reason why this occurs as it could be useful if we decide to have a deeper tree.

On the other hand, if we have latitude greater than 42.9, this corresponds to the upper half of the plot. The second split on this is on the longitude feature.

This is the vertical line here. Now anything less than -130.017 on the x-axis is predicted as USA and anything greater is predicted as Canada.

print("Train score: " + str(round(model.score(X_train, y_train), 2)))
print("Test score: " + str(round(model.score(X_test, y_test), 2)))

Train score: 1.0
Test score: 0.74

And so here’s a picture of that deeper decision tree and its decision boundaries.

On the left and the right, we have the same boundaries But different data being shown.

What’s important to see here is that the model is getting 100 percent accuracy on the training data so every time we have a red training sample, the coloring there’s also read meaning we correctly predicted it.

Every time we have a blue training sample meaning a Canadian city the background coloring there is also blue meaning we predicted it correctly.

In order to get 100 percent accuracy, the model ends up being extremely specific.

We can see this long blue horizontal section which the model is predicting contains Canadian cities.

We know that’s not true and quite silly since there is no small thin section of Canada slicing the US in half.

That the model got over complicated on the training data and this doesn’t generalize to the test data well.

In the plot on the right, we can see some red triangles in the blue area and that is the model making mistakes which explains the 71% accuracy.

We see that although the model does well on the training data, it does not do well on unseen data.

test_size and train_size arguments

train_df, test_df = train_test_split(cities_df, test_size = 0.2, random_state = 123)

shape_dict2 = {"Data portion": ["cities_df", "train_df", "test_df"],
    "Shape": [cities_df.shape, train_df.shape,
              test_df.shape]}

shape_df2 = pd.DataFrame(shape_dict2)
shape_df2

	Data portion	Shape
0	cities_df	(209, 3)
1	train_df	(167, 3)
2	test_df	(42, 3)

random_state argument

train_df_rs5, test_df_rs5 = train_test_split(cities_df, test_size = 0.2, random_state = 5)

train_df_rs7, test_df_rs7 = train_test_split(cities_df, test_size = 0.2, random_state = 7)

train_df_rs5.head(3)

	longitude	latitude	country
39	-96.7969	32.7763	USA
55	-97.5171	35.4730	USA
40	-121.8906	37.3362	USA

train_df_rs7.head(3)

	longitude	latitude	country
128	-118.7148	50.4165	Canada
195	-122.7454	53.9129	Canada
99	-72.0968	45.0072	Canada

Data Splitting