Categorical Variables: Ordinal Encoding

Remember our case study with the California housing dataset?

train_df.head()
longitude latitude housing_median_age households ... ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 ... INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 ... INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 ... NEAR OCEAN 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 ... INLAND 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 ... NEAR OCEAN 3.817558 1.004801 4.323045

5 rows × 10 columns


X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("reg", KNeighborsRegressor()),
    ]
)
pipe.fit(X_train, y_train)
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'INLAND'

Detailed traceback: 
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.12/site-packages/sklearn/base.py", line 1363, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sklearn/pipeline.py", line 653, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sklearn/pipeline.py", line 587, in _fit
    X, fitted_transformer = fit_transform_one_cached(


So what do we do?

  • Drop the column (not recommended)
  • We can transform categorical features to numeric ones so that we can use them in the model
  • There are two transformations we can do:

Ordinal encoding

X_toy
rating
0 Good
1 Bad
2 Good
3 Good
4 Bad
5 Neutral
6 Good
7 Good
8 Neutral
9 Neutral
10 Neutral
11 Good
12 Bad
13 Good
pd.DataFrame(X_toy['rating'].value_counts()).rename(columns={'rating': 'frequency'}).T
rating Good Neutral Bad
count 7 4 3
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

X_toy_ord
array([[1],
       [0],
       [1],
       [1],
       [0],
       [2],
       [1],
       [1],
       [2],
       [2],
       [2],
       [1],
       [0],
       [1]])
encoding_view = X_toy.assign(rating_enc=X_toy_ord)
encoding_view
rating rating_enc
0 Good 1
1 Bad 0
2 Good 1
3 Good 1
4 Bad 0
5 Neutral 2
6 Good 1
7 Good 1
8 Neutral 2
9 Neutral 2
10 Neutral 2
11 Good 1
12 Bad 0
13 Good 1
ratings_order = ['Bad', 'Neutral', 'Good']


oe = OrdinalEncoder(categories = [ratings_order], dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

X_toy_ord
array([[2],
       [0],
       [2],
       [2],
       [0],
       [1],
       [2],
       [2],
       [1],
       [1],
       [1],
       [2],
       [0],
       [2]])
encoding_view = X_toy.assign(rating_enc=X_toy_ord)
encoding_view
rating rating_enc
0 Good 2
1 Bad 0
2 Good 2
3 Good 2
4 Bad 0
5 Neutral 1
6 Good 2
7 Good 2
8 Neutral 1
9 Neutral 1
10 Neutral 1
11 Good 2
12 Bad 0
13 Good 2
X_toy
language
0 English
1 Vietnamese
2 English
3 Mandarin
4 English
5 English
6 Mandarin
7 English
8 Vietnamese
9 Mandarin
10 French
11 Spanish
12 Mandarin
13 Hindi
pd.DataFrame(X_toy['language'].value_counts()).rename(columns={'language': 'frequency'}).T
language English Mandarin Vietnamese French Spanish Hindi
count 5 4 2 1 1 1
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(dtype=int)
oe.fit(X_toy);
X_toy_ord = oe.transform(X_toy)

encoding_view = X_toy.assign(language_enc=X_toy_ord)
encoding_view
language language_enc
0 English 0
1 Vietnamese 5
2 English 0
3 Mandarin 3
4 English 0
5 English 0
6 Mandarin 3
7 English 0
8 Vietnamese 5
9 Mandarin 3
10 French 1
11 Spanish 4
12 Mandarin 3
13 Hindi 2

Let’s apply what we learned!