5.1. Exercises

Transforming Categorical Features

Use the diagram below to answer the following questions.

   colour  tropical location  carbs   seed  shape        size  water_content  weight
0      red     False   canada      6   True  round      small             84     100
1   yellow      True   mexico     12  False   long        med             75     120
2   orange     False    china      8   True  round      large             90    1360
3  magenta     False    china     18   True  round      small             96     600
4   purple     False   mexico     11  False  round      small             80       5
5   purple     False   canada      8  False   oval        med             78      40
6    green      True   mexico     14   True   oval        med             83     450
7     blue     False   canada      6   True  round      large             73       5
8    brown      True    china      8   True  round      large             80      76
9   yellow      True   mexico      4  False   oval        med             83      65


Categorical True or False

Transforming the Fertility Dataset

Instructions:
Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.

For this question, we will be using a dataset from assignment 1.

Here is the requested citation: David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 “ 12573, 2012

We will be making pipelines and transforming our features appropriately.

Firstly, let’s take a look at our dataset and the features.

Disclaimer: Normally we should be investing more time to fully understand the data we are analyzing. We should be checking the unique values, using .describe() and .info() to really get an idea of our features before deciding which transformations we want to apply.


Secondly, let’s split the numeric and categorical features.

Tasks:

  • What are the numeric features? Add them to a list named numeric_features.
  • What are the binary features? Add them to a list named binary_features.
  • What are the ordinal features? Add them to a list named ordinal_features.
  • What are the rest of the categorical features? Add them to a list named categorical_features.
  • Order the values in high_fevers_last_year and name the list fever_order. The options are ‘more than 3 months ago’, ‘less than 3 months ago’ and ‘no’.
  • Order the values in smoking_habit and name the list smoking_order. The options are ‘occasional’, ‘daily’ and ‘never’.
  • Order the values in freq_alcohol_con and name the list alcohol_order. The options are ‘once a week’, ‘hardly ever or never’, ‘several times a week’, ‘several times a day’ and ‘every day’.
Hint 1
  • Are you ordering the ordinal values correctly?
  • Do you have 3 binary features?
Fully worked solution:


Now, we are ready to make the pipelines and transform our features.

Tasks:

  • There are several pipelines already made for you. Designate numeric_transformer to the numerical transformer, categorical_transformer to the transformer that is not transforming binary or ordinal features, binary_transformer to the transformer of binary features, and ordinal_transformer1, ordinal_transformer2 and ordinal_transformer3 to the transformer of columns high_fevers_last_year, smoking_habit and freq_alcohol_con respectively.
  • Fill in the associated gaps in the column transformer named preprocessor.
  • Build a main pipeline using KNeighborsClassifier and name the object main_pipe.
  • Cross-validate and see the results.
Hint 1
  • Are you using the features you categorized above?
  • Are you naming the pipelines correctly?
Fully worked solution: