Preprocessing with Imputation

Case study: California housing prices

housing_df = pd.read_csv("data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

train_df.head()
longitude latitude housing_median_age total_rooms ... households median_income median_house_value ocean_proximity
6051 -117.75 34.04 22.0 2948.0 ... 602.0 3.1250 113600.0 INLAND
20113 -119.57 37.94 17.0 346.0 ... 20.0 3.4861 137500.0 INLAND
14289 -117.13 32.74 46.0 3355.0 ... 708.0 2.6604 170100.0 NEAR OCEAN
13665 -117.31 34.02 18.0 1634.0 ... 285.0 5.2139 129300.0 INLAND
14471 -117.23 32.88 18.0 5566.0 ... 1458.0 1.8580 205000.0 NEAR OCEAN

5 rows × 10 columns

We are using the data that can be downloaded here.

This dataset is a modified version of the California Housing dataset available from: Luís Torgo’s University of Porto website.

train_df = train_df.assign(rooms_per_household = train_df["total_rooms"]/train_df["households"],
                           bedrooms_per_household = train_df["total_bedrooms"]/train_df["households"],
                           population_per_household = train_df["population"]/train_df["households"])
                        
test_df = test_df.assign(rooms_per_household = test_df["total_rooms"]/test_df["households"],
                         bedrooms_per_household = test_df["total_bedrooms"]/test_df["households"],
                         population_per_household = test_df["population"]/test_df["households"])
                         
train_df = train_df.drop(columns=['total_rooms', 'total_bedrooms', 'population'])  
test_df = test_df.drop(columns=['total_rooms', 'total_bedrooms', 'population']) 

train_df.head()
longitude latitude housing_median_age households ... ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 ... INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 ... INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 ... NEAR OCEAN 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 ... INLAND 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 ... NEAR OCEAN 3.817558 1.004801 4.323045

5 rows × 10 columns

Exploratory Data Analysis (EDA)

train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 18576 entries, 6051 to 19966
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 18576 non-null  float64
 1   latitude                  18576 non-null  float64
 2   housing_median_age        18576 non-null  float64
 3   households                18576 non-null  float64
 4   median_income             18576 non-null  float64
 5   median_house_value        18576 non-null  float64
 6   ocean_proximity           18576 non-null  object 
 7   rooms_per_household       18576 non-null  float64
 8   bedrooms_per_household    18391 non-null  float64
 9   population_per_household  18576 non-null  float64
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
train_df.describe()
longitude latitude housing_median_age households ... median_house_value rooms_per_household bedrooms_per_household population_per_household
count 18576.000000 18576.000000 18576.000000 18576.000000 ... 18576.000000 18576.000000 18391.000000 18576.000000
mean -119.565888 35.627966 28.622255 500.061100 ... 206292.067991 5.426067 1.097516 3.052349
std 1.999622 2.134658 12.588307 383.044313 ... 115083.856175 2.512319 0.486266 10.020873
... ... ... ... ... ... ... ... ... ...
50% -118.490000 34.250000 29.000000 410.000000 ... 179300.000000 5.226415 1.048860 2.818868
75% -118.010000 37.710000 37.000000 606.000000 ... 263600.000000 6.051620 1.099723 3.283921
max -114.310000 41.950000 52.000000 6082.000000 ... 500001.000000 141.909091 34.066667 1243.333333

8 rows × 9 columns


train_df["bedrooms_per_household"].isnull().sum()
np.int64(185)

What happens?

X_train = train_df.drop(columns=["median_house_value", "ocean_proximity"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value", "ocean_proximity"])
y_test = test_df["median_house_value"]


knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Dropping

train_df["bedrooms_per_household"].isnull().sum()
np.int64(185)


X_train.shape
(18576, 8)


X_train_no_nan = X_train.dropna()
y_train_no_nan = y_train.dropna()

X_train_no_nan.shape
(18391, 8)

Dropping a column

X_train.shape
(18576, 8)


X_train_no_col = X_train.dropna(axis=1)

X_train_no_col.shape
(18576, 7)

Imputation

Imputation: Imputation means inventing values for the missing data.

from sklearn.impute import SimpleImputer

We can impute missing values in:

  • Categorical columns: with the most frequent value.
  • Numeric columns: with the mean or median of the column or a constant of our choosing.
X_train.sort_values('bedrooms_per_household').tail(10)
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
18786 -122.42 40.44 16.0 181.0 2.1875 5.491713 NaN 2.734807
17923 -121.97 37.35 30.0 386.0 4.6328 5.064767 NaN 2.588083
16880 -122.39 37.59 32.0 715.0 6.1323 6.289510 NaN 2.581818
... ... ... ... ... ... ... ... ...
6962 -118.05 33.99 38.0 357.0 3.7328 4.535014 NaN 2.481793
14970 -117.01 32.74 31.0 677.0 2.6973 5.129985 NaN 3.098966
7763 -118.10 33.91 36.0 130.0 3.6389 5.584615 NaN 3.769231

10 rows × 8 columns

imputer = SimpleImputer(strategy="median")
imputer.fit(X_train);
X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test)

X_train_imp
array([[-117.75      ,   34.04      ,   22.        , ...,    4.89700997,    1.05647841,    4.31893688],
       [-119.57      ,   37.94      ,   17.        , ...,   17.3       ,    6.5       ,    2.55      ],
       [-117.13      ,   32.74      ,   46.        , ...,    4.73870056,    1.08474576,    2.0579096 ],
       ...,
       [-121.76      ,   37.33      ,    5.        , ...,    5.95839311,    1.03156385,    3.49354376],
       [-122.44      ,   37.78      ,   44.        , ...,    4.7392638 ,    1.02453988,    1.7208589 ],
       [-119.08      ,   36.21      ,   20.        , ...,    5.49137931,    1.11781609,    3.56609195]], shape=(18576, 8))
X_train_imp_df = pd.DataFrame(X_train_imp, columns = X_train.columns, index = X_train.index)
X_train_imp_df.loc[[7763]]
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
7763 -118.1 33.91 36.0 130.0 3.6389 5.584615 1.04886 3.769231


X_train.loc[[7763]]
longitude latitude housing_median_age households median_income rooms_per_household bedrooms_per_household population_per_household
7763 -118.1 33.91 36.0 130.0 3.6389 5.584615 NaN 3.769231
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()
knn.fit(X_train_imp, y_train);
knn.score(X_train_imp, y_train)
0.5609808539232339

Let’s apply what we learned!