Imbalanced Datasets

Class imbalance in training sets

X_train.head()
Time V1 V2 V3 ... V26 V27 V28 Amount
121775 76314.0 1.505415 -0.546326 -0.518913 ... 0.022535 -0.017247 -0.010005 20.00
128746 78823.0 -0.735559 0.459686 2.093094 ... -0.597624 -0.285776 -0.258780 12.99
59776 48998.0 1.217941 0.783337 -0.070014 ... -0.352676 0.041878 0.057699 1.00
282774 171138.0 2.024211 -0.586693 -2.554675 ... 0.214823 -0.008147 -0.068130 10.00
268042 163035.0 -0.151161 1.067465 -0.771064 ... -0.209491 0.213933 0.233276 36.26

5 rows × 30 columns


y_train.value_counts('Class')
Class
0    0.998237
1    0.001763
Name: proportion, dtype: float64

Addressing class imbalance

A very important question to ask yourself:
“Why do I have a class imbalance?”

  • Is it because one class is much rarer than the other?

  • Is it because of my data collection methods?

But, if you answer “no” to both of these, it may be fine to just ignore the class imbalance.

Handling imbalance

There are two common approaches to this:

  1. Changing the training procedure

  2. Changing the data (not in this course)

    • Undersampling
    • Oversampling

Changing the training procedure: class_weight

404 image


`class_weight: dict or ‘balanced’, default=None`

Set the parameter C of class i to class_weight[i] * C for SVC. 
If not given, all classes are supposed to have weight one. 
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in 
the input data as n_samples / (n_classes * np.bincount(y))
from sklearn.tree import DecisionTreeClassifier

tree_default= DecisionTreeClassifier(random_state=7)
tree_default.fit(X_train,y_train);


tree_100 = DecisionTreeClassifier(random_state=7, class_weight={1:100})
tree_100.fit(X_train,y_train);

class_weight=“balanced”

tree_balanced =DecisionTreeClassifier(random_state=7, class_weight="balanced")
tree_balanced.fit(X_train,y_train);

Are we doing better with class_weight=“balanced”?

tree_default.score(X_valid, y_valid)
0.9989968232737001


tree_balanced.score(X_valid, y_valid)
0.99879618792844

Stratified Splits

404 image

Attribution: Scikit Learn

404 image

Attribution: Scikit Learn

Is stratifying a good idea?

Yes and no:

  • No longer a random sample.
  • It can be especially useful in multi-class situations.

But in general, these are difficult questions to answer.

Let’s apply what we learned!