Data Splitting

Recap

Training score versus generalization score

Given a model, in Machine Learning (ML), people usually talk about two kinds of scores (accuracies):

  1. Score on the training data

  2. Score on the entire distribution of data

We can approximate generalization accuracy by splitting our data!

404 image

Simple train and test split

404 image
404 image

Attribution

cities_df = pd.read_csv("data/canada_usa_cities.csv")
cities_df
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
... ... ... ...
206 -79.2506 42.9931 Canada
207 -72.9406 45.6275 Canada
208 -79.4608 46.3092 Canada

209 rows × 3 columns

X = cities_df.drop(columns=["country"])
X
longitude latitude
0 -130.0437 55.9773
1 -134.4197 58.3019
2 -123.0780 48.9854
... ... ...
206 -79.2506 42.9931
207 -72.9406 45.6275
208 -79.4608 46.3092

209 rows × 2 columns

y = cities_df["country"]
y
0         USA
1         USA
2         USA
        ...  
206    Canada
207    Canada
208    Canada
Name: country, Length: 209, dtype: object
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)
X_train.head(3)
longitude latitude
160 -76.4813 44.2307
127 -81.2496 42.9837
169 -66.0580 45.2788
X_test.head(3)
longitude latitude
172 -64.8001 46.0980
175 -82.4066 42.9746
181 -111.3885 56.7292


y_train.head(3)
160    Canada
127    Canada
169    Canada
Name: country, dtype: object


y_test.head(3)
172    Canada
175    Canada
181    Canada
Name: country, dtype: object
shape_dict = {"Data portion": ["X", "y", "X_train", "y_train", "X_test", "y_test"],
    "Shape": [X.shape, y.shape,
              X_train.shape, y_train.shape,
              X_test.shape, y_test.shape]}

shape_df = pd.DataFrame(shape_dict)
shape_df
Data portion Shape
0 X (209, 2)
1 y (209,)
2 X_train (167, 2)
3 y_train (167,)
4 X_test (42, 2)
5 y_test (42,)
train_df, test_df = train_test_split(cities_df, test_size = 0.2, random_state = 123)

X_train, y_train = train_df.drop(columns=["country"]), train_df["country"]

X_test, y_test = test_df.drop(columns=["country"]), test_df["country"]

train_df.head()
longitude latitude country
160 -76.4813 44.2307 Canada
127 -81.2496 42.9837 Canada
169 -66.0580 45.2788 Canada
188 -73.2533 45.3057 Canada
187 -67.9245 47.1652 Canada
chart_cities = alt.Chart(train_df).mark_circle(size=20, opacity=0.6).encode(
    alt.X('longitude:Q', scale=alt.Scale(domain=[-140, -40])),
    alt.Y('latitude:Q', scale=alt.Scale(domain=[20, 60])),
    alt.Color('country:N', scale=alt.Scale(domain=['Canada', 'USA'],
                                           range=['red', 'blue'])))
chart_cities
A caption
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train);
404 image
404 image
404 image
print("Train score: " + str(round(model.score(X_train, y_train), 2)))
print("Test score: " + str(round(model.score(X_test, y_test), 2)))
Train score: 1.0
Test score: 0.74

test_size and train_size arguments

train_df, test_df = train_test_split(cities_df, test_size = 0.2, random_state = 123)
shape_dict2 = {"Data portion": ["cities_df", "train_df", "test_df"],
    "Shape": [cities_df.shape, train_df.shape,
              test_df.shape]}

shape_df2 = pd.DataFrame(shape_dict2)
shape_df2
Data portion Shape
0 cities_df (209, 3)
1 train_df (167, 3)
2 test_df (42, 3)

random_state argument

train_df_rs5, test_df_rs5 = train_test_split(cities_df, test_size = 0.2, random_state = 5)
train_df_rs7, test_df_rs7 = train_test_split(cities_df, test_size = 0.2, random_state = 7)


train_df_rs5.head(3)
longitude latitude country
39 -96.7969 32.7763 USA
55 -97.5171 35.4730 USA
40 -121.8906 37.3362 USA


train_df_rs7.head(3)
longitude latitude country
128 -118.7148 50.4165 Canada
195 -122.7454 53.9129 Canada
99 -72.0968 45.0072 Canada

Let’s apply what we learned!