4.1. Exercises

True or False: Unbalanced Data

Balancing our Data in Action

Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.

Let’s bring back the Pokémon dataset that we’ve seen a few times.

After splitting and inspecting the target column we see that this dataset is fairly unbalanced.

In this case, our positive label is whether a Pokémon is “legendary” or not. In our dataset, a value of 1 represents a legendary Pokémon and 0 is a non-legendary one.

Let’s see how our measurements differ when we balance our datasets.

Tasks:

  • Build a pipeline containing the column transformer and an SVC model with default hyperparameters. Fit this pipeline and name it pipe_unbalanced.
  • Predict your values on the validation set and save them in an object named unbalanced_predicted.
  • Using sklearn tools, print a classification report comparing the validation y labels to unbalanced_predicted. Set digits=3.
Hint 1
  • Are you coding unbalanced_predicted as make_pipeline(preprocessor, SVC()).
  • Are you fitting on the training set?
  • Are you building a classification report with classification_report(y_valid, unbalanced_predicted, digits=2)?
Fully worked solution:


Tasks:

  • Next, build a pipeline containing the column transformer and an SVC model but this time setting class_weight="balanced" in the SVM classifier. Name this pipeline in an object called pipe_balanced and fit it on the training data.
  • Predict values on the validation set using pipe_balanced and save them in an object named balanced_predicted.
  • Print another classification report comparing the validation y labels to balanced_predicted.
Hint 1
  • Are you building make_pipeline(preprocessor, SVC(class_weight="balanced")) and fitting it?
  • Are you predicting the values from the balanced pipeline using pipe_balanced.predict(X_valid) and naming it balanced_predicted?
  • Are you building a classification report with classification_report(y_valid, balanced_predicted, digits=2)?
Fully worked solution: