4.1. Exercises

Probabilities and Logistic Regression

We are trying to predict if a job applicant would be hired based on some features contained in their resume.

Below we have the output of .predict_proba() where column 0 shows the probability the model would predict “hired” and column 1 shows the probability the model would predict “not hired”.

array([[0.04971843, 0.95028157],
       [0.94173513, 0.05826487],
       [0.74133975, 0.25866025],
       [0.13024982, 0.86975018],
       [0.17126403, 0.82873597]])

Use this output to answer the following questions.

Question 2

['hired', 'hired', 'hired', 'not hired', 'not hired']

True or False: predict_proba

Applying predict_proba

Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.

Let’s keep working with the Pokémon dataset. This time let’s do a bit more. Let’s hyperparameter tune our C and see if we can find an example where the model is confident in its prediction.

Tasks:

  • Build and fit a pipeline containing the column transformer and a logistic regression model that uses the parameter class_weight="balanced" and max_iter=1000(max_iter will stop a warning from occuring) . Name this pipeline pkm_pipe.
  • Perform RandomizedSearchCV using the parameters specified in param_grid. Use n_iter equal to 10, 5 cross-validation folds and return the training score. Set random_state=2028 and set your scoring argument to f1. Name this object pmk_search.
  • Fit your pmk_search on the training data.
  • What is the best C value? Save it in an object name pkm_best_c.
  • What is the best f1 score? Save it in an object named pkm_best_score.
  • Find the predictions of the test set using predict. Save this in an object named predicted_y.
  • Find the target class probabilities of the test set using predict_proba.
  • Save this in an object named proba_y.
  • Take the dataframe lr_probs and sort them in descending order of the model’s confidence in predicting legendary Pokémon. Save this in an object named legend_sorted.
Hint 1
  • Are you using make_pipeline(preprocessor, LogisticRegression(class_weight="balanced")) to build your pkm_pipe object?
  • In RandomizedSearchCV are you calling pkm_pipe and param_grid?
  • Are you specifying n_iter=10 and scoring = 'f1'?
  • Are you fitting pkm_grid on your training data?
  • Are you using best_params_ to find the most optimal C value?
  • Are you using best_score_ to find the best score?
  • For predicted_y, are you using pmk_search.predict(X_test)?
  • For proba_y are you using pmk_search.predict_proba(X_test)?
  • Are you sorting lr_probs by prob_legend and setting ascending = False?
Fully worked solution: