missing_values
missing_values
Functions
| Name | Description |
|---|---|
| missing_values | This function fills missing values (NaN) in a pandas DataFrame using |
missing_values
missing_values.missing_values(df, method='median')This function fills missing values (NaN) in a pandas DataFrame using column-appropriate imputation strategies.
This function imputes missing values in both numeric and categorical columns. Numeric columns are filled using a user-specified method (mean, median, or mode), while categorical (non-numeric) columns are automatically filled using mode imputation.
Missing values can distort statistical analyses and machine learning models. This function provides common strategies for imputing missing values depending on the nature of the data distribution.
The function identifies numeric and non-numeric columns and applies imputation independently to each column.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | The DataFrame containing missing values to be imputed. | required |
| method | str | The imputation method to use for numeric columns. Valid options are: - ‘mean’ : Replace NaN with column mean (suitable for symmetric data) - ‘median’ : Replace NaN with column median (robust to outliers) - ‘mode’ : Replace NaN with column mode Categorical (non-numeric) columns always use mode imputation regardless of the selected method. | "median" |
Returns
| Name | Type | Description |
|---|---|---|
| (pd.DataFrame, float) | result_df : pd.DataFrame A DataFrame with missing values filled in both numeric and categorical (non-numeric) columns. filled_percentage : float The percentage of total DataFrame values that were originally missing and have been filled, calculated as: (number of filled values / number of total values) * 100. Columns containing only NaN values are left unchanged and do not contribute any filled values to this percentage. |
Raises
| Name | Type | Description |
|---|---|---|
| TypeError | If df is not a pandas DataFrame. | |
| ValueError | If method is not one of the 3 supported numeric options. |
Notes
- Numeric columns are imputed using the specified method.
- Categorical (non-numeric) columns are imputed using mode.
- Imputation is applied column-wise.
- Columns containing all NaN values are left unchanged and do not affect the filled percentage.
- If multiple modes exist (for both numeric and categorical columns), the first mode returned by pandas is used.
- The original DataFrame is not modified; a copy is returned.
- The filled percentage includes values filled in both numeric and categorical (non-numeric) columns.
Examples
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'age': [25, 30, np.nan, 28],
... 'income': [50000, np.nan, 52000, np.nan],
... 'city': ['A', 'B', np.nan, 'B']
... })
>>> result_df, filled_percentage = missing_values(df, method='median')
>>> print(result_df)
age income city
0 25.0 50000.0 A
1 30.0 51000.0 B
2 28.0 52000.0 B
3 28.0 51000.0 B
>>> print(f"{filled_percentage:.1f}% of values were filled.")
33.3% of values were filled.