Module Learning Outcomes

By the end of the module, students are expected to:

Explain handle_unknown="ignore" hyperparameter of scikit-learn’s OneHotEncoder.
Identify when it’s appropriate to apply ordinal encoding vs one-hot encoding.
Explain strategies to deal with categorical variables with too many categories.
Explain why text data needs a different treatment than categorical variables.
Use scikit-learn’s CountVectorizer to encode text data.
Explain different hyperparameters of CountVectorizer.
Use ColumnTransformer to build all our transformations together into one object and use it with scikit-learn pipelines.

Let’s start!