Introduction to Machine Learning – Baselines: Training a Model using Scikit-learn

Supervised Learning (Reminder)

Tabular data → Machine learning algorithm → ML model → new examples → predictions

Building a simplest machine learning model using sklearn

Baseline model:

most frequent baseline: always predicts the most frequent label in the training set.

Data

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head()

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1	quiz2
0	1	1	92	93	84	91	92	A+
1	1	0	94	90	80	83	91	not A+
2	0	0	78	85	83	80	80	not A+
3	0	1	91	94	92	91	89	A+
4	0	1	77	83	90	92	85	A+

1. Create 𝑋 and 𝑦

𝑋 → Feature vectors
𝑦 → Target

X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

2. Create a classifier object

import the appropriate classifier.
Create an object of the classifier.

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

3. Fit the classifier

dummy_clf.fit(X, y)

4. Predict the target of given examples

We can predict the target of examples by calling predict on the classifier object.

Let’s see what it predicts for a single observation first:

single_obs = X.loc[[0]]
single_obs

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92

dummy_clf.predict(single_obs)

array(['not A+'], dtype='<U6')

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
...	...	...	...	...	...	...	...
18	1	1	91	93	90	88	82
19	0	1	77	94	87	81	89
20	1	1	96	92	92	96	87

21 rows × 7 columns

dummy_clf.predict(X)

array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+'], dtype='<U6')

5. Scoring your model

In the classification setting, .score() gives the accuracy of the model, i.e., proportion of correctly predicted observations.

Sometimes you will also see people reporting error, which is usually 1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

print("The accuracy of the model on the training data:", (round(dummy_clf.score(X, y), 3)))

The accuracy of the model on the training data: 0.524

print("The error of the model on the training data:", (round(1 - dummy_clf.score(X, y), 3)))

The error of the model on the training data: 0.476

fit and predict paradigms

The general pattern when we build ML models using sklearn:

Creating your 𝑋 and 𝑦 objects
clf = DummyClassifier() → create a model (here we are naming it clf)
clf.fit(X, y) → train the model
clf.score(X, y) → assess the model
clf.predict(Xnew) → predict on some new data using the trained model