4.1. Exercises
True or False: Unbalanced Data
Balancing our Data in Action
Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.
When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.
Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.
Let’s bring back the Pokémon dataset that we’ve seen a few times.
After splitting and inspecting the target column we see that this dataset is fairly unbalanced.
In this case, our positive label is whether a Pokémon is “legendary” or not. In our dataset, a value of 1 represents a legendary Pokémon and 0 is a non-legendary one.
Let’s see how our measurements differ when we balance our datasets.
Tasks:
- Build a pipeline containing the column transformer and an SVC model with default hyperparameters. Fit this pipeline and name it
pipe_unbalanced. - Predict your values on the validation set and save them in an object named
unbalanced_predicted. - Using sklearn tools, print a classification report comparing the validation y labels to
unbalanced_predicted. Setdigits=3.
Tasks:
- Next, build a pipeline containing the column transformer and an SVC model but this time setting
class_weight="balanced"in the SVM classifier. Name this pipeline in an object calledpipe_balancedand fit it on the training data. - Predict values on the validation set using
pipe_balancedand save them in an object namedbalanced_predicted. - Print another classification report comparing the validation y labels to
balanced_predicted.