One-Hot encoding

From before …

encoding_view
language language_enc
0 English 0
1 Vietnamese 5
2 English 0
3 Mandarin 3
4 English 0
5 English 0
6 Mandarin 3
7 English 0
8 Vietnamese 5
9 Mandarin 3
10 French 1
11 Spanish 4
12 Mandarin 3
13 Hindi 2

What wrong with this?

oe.categories_
[array(['English', 'French', 'Hindi', 'Mandarin', 'Spanish', 'Vietnamese'], dtype=object)]


encoding_view.drop_duplicates()
language language_enc
0 English 0
1 Vietnamese 5
3 Mandarin 3
10 French 1
11 Spanish 4
13 Hindi 2

One-hot encoding (OHE)

Ordinal encoding:

encoding_view[['language_enc']].head()
language_enc
0 0
1 5
2 0
3 3
4 0


One-hot encoding:

one_hot_df.head()
language_English language_French language_Hindi language_Mandarin language_Spanish language_Vietnamese
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0

How to one-hot encode

X_toy
language
0 English
1 Vietnamese
2 English
3 Mandarin
4 English
5 English
6 Mandarin
7 English
8 Vietnamese
9 Mandarin
10 French
11 Spanish
12 Mandarin
13 Hindi
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, dtype='int')
ohe.fit(X_toy);
X_toy_ohe = ohe.transform(X_toy)

X_toy_ohe
array([[1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0]])
pd.DataFrame(
    data=X_toy_ohe,
    columns=ohe.get_feature_names_out(['language']),
    index=X_toy.index,
)
language_English language_French language_Hindi language_Mandarin language_Spanish language_Vietnamese
0 1 0 0 0 0 0
1 0 0 0 0 0 1
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 0 0 1 0 0
7 1 0 0 0 0 0
8 0 0 0 0 0 1
9 0 0 0 1 0 0
10 0 1 0 0 0 0
11 0 0 0 0 1 0
12 0 0 0 1 0 0
13 0 0 1 0 0 0
X_train.head()
longitude latitude housing_median_age households ... ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 ... INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 ... INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 ... NEAR OCEAN 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 ... INLAND 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 ... NEAR OCEAN 3.817558 1.004801 4.323045

5 rows × 9 columns


X_train['ocean_proximity'].unique()
array(['INLAND', 'NEAR OCEAN', '<1H OCEAN', 'NEAR BAY', 'ISLAND'], dtype=object)

One hot encoding the California housing data

ohe = OneHotEncoder(sparse_output=False, dtype="int")
ohe.fit(X_train[["ocean_proximity"]])
X_imp_ohe_train = ohe.transform(X_train[["ocean_proximity"]])

X_imp_ohe_train
array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0]], shape=(18576, 5))
transformed_ohe = pd.DataFrame(
    data=X_imp_ohe_train,
    columns=ohe.get_feature_names_out(['ocean_proximity']),
    index=X_train.index,
)

transformed_ohe.head()
ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
6051 0 1 0 0 0
20113 0 1 0 0 0
14289 0 0 0 0 1
13665 0 1 0 0 0
14471 0 0 0 0 1

Let’s apply what we learned!