Introduction to Machine Learning – Imbalanced Datasets

Class imbalance in training sets

X_train.head()

	Time	V1	V2	V3	...	V26	V27	V28	Amount
121775	76314.0	1.505415	-0.546326	-0.518913	...	0.022535	-0.017247	-0.010005	20.00
128746	78823.0	-0.735559	0.459686	2.093094	...	-0.597624	-0.285776	-0.258780	12.99
59776	48998.0	1.217941	0.783337	-0.070014	...	-0.352676	0.041878	0.057699	1.00
282774	171138.0	2.024211	-0.586693	-2.554675	...	0.214823	-0.008147	-0.068130	10.00
268042	163035.0	-0.151161	1.067465	-0.771064	...	-0.209491	0.213933	0.233276	36.26

5 rows × 30 columns

y_train.value_counts('Class')

Class
0    0.998237
1    0.001763
Name: proportion, dtype: float64

Addressing class imbalance

A very important question to ask yourself:
“Why do I have a class imbalance?”

Is it because one class is much rarer than the other?
Is it because of my data collection methods?

But, if you answer “no” to both of these, it may be fine to just ignore the class imbalance.

Handling imbalance

There are two common approaches to this:

Changing the training procedure
Changing the data (not in this course)
- Undersampling
- Oversampling

Changing the training procedure: class_weight

`class_weight: dict or ‘balanced’, default=None`

Set the parameter C of class i to class_weight[i] * C for SVC. 
If not given, all classes are supposed to have weight one. 
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in 
the input data as n_samples / (n_classes * np.bincount(y))

from sklearn.tree import DecisionTreeClassifier

tree_default= DecisionTreeClassifier(random_state=7)
tree_default.fit(X_train,y_train);

tree_100 = DecisionTreeClassifier(random_state=7, class_weight={1:100})
tree_100.fit(X_train,y_train);

class_weight=“balanced”

tree_balanced =DecisionTreeClassifier(random_state=7, class_weight="balanced")
tree_balanced.fit(X_train,y_train);

Are we doing better with class_weight=“balanced”?

tree_default.score(X_valid, y_valid)

0.9989968232737001

tree_balanced.score(X_valid, y_valid)

0.99879618792844

Stratified Splits

Attribution: Scikit Learn

Is stratifying a good idea?

Yes and no:

No longer a random sample.
It can be especially useful in multi-class situations.

But in general, these are difficult questions to answer.