5.1. Exercises
Transforming Categorical Features
Use the diagram below to answer the following questions.
colour tropical location carbs seed shape size water_content weight
0 red False canada 6 True round small 84 100
1 yellow True mexico 12 False long med 75 120
2 orange False china 8 True round large 90 1360
3 magenta False china 18 True round small 96 600
4 purple False mexico 11 False round small 80 5
5 purple False canada 8 False oval med 78 40
6 green True mexico 14 True oval med 83 450
7 blue False canada 6 True round large 73 5
8 brown True china 8 True round large 80 76
9 yellow True mexico 4 False oval med 83 65
Categorical True or False
Transforming the Fertility Dataset
Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.
When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.
Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.
For this question, we will be using a dataset from assignment 1.
Here is the requested citation: David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 “ 12573, 2012
We will be making pipelines and transforming our features appropriately.
Firstly, let’s take a look at our dataset and the features.
Disclaimer: Normally we should be investing more time to fully understand the data we are analyzing. We should be checking the unique values, using .describe() and .info() to really get an idea of our features before deciding which transformations we want to apply.
Secondly, let’s split the numeric and categorical features.
Tasks:
- What are the numeric features? Add them to a list named
numeric_features. - What are the binary features? Add them to a list named
binary_features. - What are the ordinal features? Add them to a list named
ordinal_features. - What are the rest of the categorical features? Add them to a list named
categorical_features. - Order the values in
high_fevers_last_yearand name the listfever_order. The options are ‘more than 3 months ago’, ‘less than 3 months ago’ and ‘no’. - Order the values in
smoking_habitand name the listsmoking_order. The options are ‘occasional’, ‘daily’ and ‘never’. - Order the values in
freq_alcohol_conand name the listalcohol_order. The options are ‘once a week’, ‘hardly ever or never’, ‘several times a week’, ‘several times a day’ and ‘every day’.
Now, we are ready to make the pipelines and transform our features.
Tasks:
- There are several pipelines already made for you. Designate
numeric_transformerto the numerical transformer,categorical_transformerto the transformer that is not transforming binary or ordinal features,binary_transformerto the transformer of binary features, andordinal_transformer1,ordinal_transformer2andordinal_transformer3to the transformer of columnshigh_fevers_last_year,smoking_habitandfreq_alcohol_conrespectively. - Fill in the associated gaps in the column transformer named
preprocessor. - Build a main pipeline using
KNeighborsClassifierand name the objectmain_pipe. - Cross-validate and see the results.