Baselines: Training a Model using Scikit-learn

Supervised Learning (Reminder)

  • Tabular data → Machine learning algorithm → ML model → new examples → predictions
A caption

Building a simplest machine learning model using sklearn



Baseline model:

most frequent baseline: always predicts the most frequent label in the training set.

Data

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head()
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1 quiz2
0 1 1 92 93 84 91 92 A+
1 1 0 94 90 80 83 91 not A+
2 0 0 78 85 83 80 80 not A+
3 0 1 91 94 92 91 89 A+
4 0 1 77 83 90 92 85 A+

1. Create 𝑋 and 𝑦

𝑋 → Feature vectors
𝑦 → Target

X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

2. Create a classifier object

  • import the appropriate classifier.
  • Create an object of the classifier.
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

3. Fit the classifier

dummy_clf.fit(X, y)

4. Predict the target of given examples

We can predict the target of examples by calling predict on the classifier object.

Let’s see what it predicts for a single observation first:

single_obs = X.loc[[0]]
single_obs
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92


dummy_clf.predict(single_obs)
array(['not A+'], dtype='<U6')
X
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92
1 1 0 94 90 80 83 91
2 0 0 78 85 83 80 80
... ... ... ... ... ... ... ...
18 1 1 91 93 90 88 82
19 0 1 77 94 87 81 89
20 1 1 96 92 92 96 87

21 rows × 7 columns


dummy_clf.predict(X)
array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+'], dtype='<U6')

5. Scoring your model

In the classification setting, .score() gives the accuracy of the model, i.e., proportion of correctly predicted observations.

Sometimes you will also see people reporting error, which is usually 1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

print("The accuracy of the model on the training data:", (round(dummy_clf.score(X, y), 3)))
The accuracy of the model on the training data: 0.524


print("The error of the model on the training data:", (round(1 - dummy_clf.score(X, y), 3)))
The error of the model on the training data: 0.476

fit and predict paradigms

The general pattern when we build ML models using sklearn:

  1. Creating your 𝑋 and 𝑦 objects
  2. clf = DummyClassifier() → create a model (here we are naming it clf)
  3. clf.fit(X, y) → train the model
  4. clf.score(X, y) → assess the model
  5. clf.predict(Xnew) → predict on some new data using the trained model

Let’s apply what we learned!