df = pd.read_csv("data/canada_usa_cities.csv")
X = df.drop(columns=["country"])
y = df["country"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123)score_train: is our training score (or mean train score from cross-validation).
score_valid is our validation score (or mean validation score from cross-validation).
score_test is our test score.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
results_dict = {"depth": list(), "mean_train_score": list(), "mean_cv_score": list()}
for depth in range(1,20):
model = DecisionTreeClassifier(max_depth=depth)
scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
results_dict["depth"].append(depth)
results_dict["mean_cv_score"].append(scores["test_score"].mean())
results_dict["mean_train_score"].append(scores["train_score"].mean())
results_df = pd.DataFrame(results_dict)depth 5.000000
mean_train_score 0.918848
mean_cv_score 0.845956
Name: 4, dtype: float64
best_depth = int(results_df.sort_values('mean_cv_score', ascending=False).iloc[0]['depth'])
best_depth5
Even though we care the most about test score:
THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY
X and y into X_train, X_test, y_train, y_test or train_df and test_df using train_test_split.cross_validate with return_train_score = True so that we can get access to training scores in each fold. (If we want to plot train vs validation error plots, for instance.)