Introduction to Machine Learning – Categorical Variables: Ordinal Encoding

Remember our case study with the California housing dataset?

train_df.head()

	longitude	latitude	housing_median_age	households	...	ocean_proximity	rooms_per_household	bedrooms_per_household	population_per_household
6051	-117.75	34.04	22.0	602.0	...	INLAND	4.897010	1.056478	4.318937
20113	-119.57	37.94	17.0	20.0	...	INLAND	17.300000	6.500000	2.550000
14289	-117.13	32.74	46.0	708.0	...	NEAR OCEAN	4.738701	1.084746	2.057910
13665	-117.31	34.02	18.0	285.0	...	INLAND	5.733333	0.961404	3.154386
14471	-117.23	32.88	18.0	1458.0	...	NEAR OCEAN	3.817558	1.004801	4.323045

5 rows × 10 columns

X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("reg", KNeighborsRegressor()),
    ]
)

pipe.fit(X_train, y_train)

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'INLAND'

Detailed traceback: 
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.12/site-packages/sklearn/base.py", line 1363, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sklearn/pipeline.py", line 653, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sklearn/pipeline.py", line 587, in _fit
    X, fitted_transformer = fit_transform_one_cached(

So what do we do?

Drop the column (not recommended)
We can transform categorical features to numeric ones so that we can use them in the model
There are two transformations we can do:
- Ordinal encoding
- One-hot encoding(recommended in most cases)

Ordinal encoding

X_toy

	rating
0	Good
1	Bad
2	Good
3	Good
4	Bad
5	Neutral
6	Good
7	Good
8	Neutral
9	Neutral
10	Neutral
11	Good
12	Bad
13	Good

pd.DataFrame(X_toy['rating'].value_counts()).rename(columns={'rating': 'frequency'}).T

rating	Good	Neutral	Bad
count	7	4	3

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

X_toy_ord

array([[1],
       [0],
       [1],
       [1],
       [0],
       [2],
       [1],
       [1],
       [2],
       [2],
       [2],
       [1],
       [0],
       [1]])

encoding_view = X_toy.assign(rating_enc=X_toy_ord)
encoding_view

	rating	rating_enc
0	Good	1
1	Bad	0
2	Good	1
3	Good	1
4	Bad	0
5	Neutral	2
6	Good	1
7	Good	1
8	Neutral	2
9	Neutral	2
10	Neutral	2
11	Good	1
12	Bad	0
13	Good	1

ratings_order = ['Bad', 'Neutral', 'Good']

oe = OrdinalEncoder(categories = [ratings_order], dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

X_toy_ord

array([[2],
       [0],
       [2],
       [2],
       [0],
       [1],
       [2],
       [2],
       [1],
       [1],
       [1],
       [2],
       [0],
       [2]])

encoding_view = X_toy.assign(rating_enc=X_toy_ord)
encoding_view

	rating	rating_enc
0	Good	2
1	Bad	0
2	Good	2
3	Good	2
4	Bad	0
5	Neutral	1
6	Good	2
7	Good	2
8	Neutral	1
9	Neutral	1
10	Neutral	1
11	Good	2
12	Bad	0
13	Good	2

X_toy

	language
0	English
1	Vietnamese
2	English
3	Mandarin
4	English
5	English
6	Mandarin
7	English
8	Vietnamese
9	Mandarin
10	French
11	Spanish
12	Mandarin
13	Hindi

pd.DataFrame(X_toy['language'].value_counts()).rename(columns={'language': 'frequency'}).T

language	English	Mandarin	Vietnamese	French	Spanish	Hindi
count	5	4	2	1	1	1

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

encoding_view = X_toy.assign(language_enc=X_toy_ord)
encoding_view

	language	language_enc
0	English	0
1	Vietnamese	5
2	English	0
3	Mandarin	3
4	English	0
5	English	0
6	Mandarin	3
7	English	0
8	Vietnamese	5
9	Mandarin	3
10	French	1
11	Spanish	4
12	Mandarin	3
13	Hindi	2

Categorical Variables: Ordinal Encoding

Remember our case study with the California housing dataset?

So what do we do?

Ordinal encoding

Let’s apply what we learned!