3.1. Exercises

Transforming Columns with ColumnTransformer

Refer to the dataframe to answer the following question.

       colour   location    shape   water_content  weight
0       red      canada      NaN         84          100
1     yellow     mexico     long         75          120
2     orange     spain       NaN         90          NaN
3    magenta     china      round        NaN         600
4     purple    austria      NaN         80          115
5     purple    turkey      oval         78          340
6     green     mexico      oval         83          NaN
7      blue     canada      round        73          535
8     brown     china        NaN         NaN        1743  
9     yellow    mexico      oval         83          265


Transforming True or False

Your Turn with Column Transforming

Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line wonโ€™t execute and you can test your code after each step.

Letโ€™s now start doing transformations and working with them with our basketball dataset.

Weโ€™ve provided you with the numerical and categorical features, itโ€™s your turn to make a pipeline for each and then use ColumnTransformer to transform them.

We have a regression problem this time where we are attempting to predict a playerโ€™s salary.

Tasks:

  • Create a pipeline for the numeric features. It should have the first step as simple imputation using strategy="median" and the second step should be using StandardScaler. Name this pipeline numeric_transformer.
  • Create a pipeline for the categorical features. It should also have 2 steps. The first is imputation using strategy="most_frequent". The second step should be one-hot encoding with handle_unknown="ignore". Name this pipeline categorical_transformer.
  • Make your column transformer named col_transformer and specify the transformations on numeric_features and categorical_features using the appropriate pipelines you build above.
  • Create a main pipeline named main_pipe which preprocesses with col_transformer followed by building a KNeighborsRegressor model.
  • The last step is performing cross-validation using our pipeline.
Hint 1
  • Are you using SimpleImputer(strategy="median") for numerical imputation?
  • Are you naming your steps?
  • Are you using SimpleImputer(strategy="most_frequent") for categorical imputation?
  • Are you using one-hot encoding?
  • Are you naming the steps in ColumnTransformer and specifying numeric_transformer with numeric_features and categorical_transformer with categorical_features?
  • Is the first step in your main pipeline calling col_transformer?
  • Are you calling main_pipe in cross_validate()?
Fully worked solution: