simplify._ambiguous_columns_split

simplify._ambiguous_columns_split(
    pd_dataframe,
    target_column,
    ambiguous_column_types=None,
)

Separates numeric and categorical columns for a pandas Dataframe, and applies overrides for ambiguous cases via input. Hidden function used purely for all_distributions function.

This function automatically classifies DataFrame columns as numeric or categorical based on their data types. Supports manual overrides when automatic classification is incorrect (e.g., a numeric zip code that should be treated as categorical).

Parameters

Name Type Description Default
pd_dataframe pandas.DataFrame Input DataFrame to separate into numeric and categorical columns. required
target_column str The name of the target column. Regardless of dtype, target column is included in both numeric and categorical outputs. required
ambiguous_column_types dict Dictionary specifying column type overrides for ambiguous cases. Expected keys are “numeric” and “categorical”, each containing a list of column names to force into that category. Invalid or non-existent column names are silently ignored. Numeric definded as: int, float, and complex, including int/float 32/64, np.number and boolean columns too (Pandas behaviour). Categorical definded as: Non-numeric columns, including object, string, datetime, and categorical dtypes. Example: ambiguous_column_types = {“numeric”: [“year”], “categorical”: [“zip_code”]} None

Returns

Name Type Description
dict A dictionary with keys “numeric” and “categorical”, each containing a filtered DataFrame with only the columns of that type.

Raises

Name Type Description
ValueError If the input DataFrame is empty.
ValueError If a column is specified in both “numeric” and “categorical” lists in ambiguous_column_types.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 40],
...     'income': [50000, 60000, 75000, 80000],
...     'city': ['NYC', 'LA', 'Chicago', 'Boston'],
...     'education': ['BS', 'MS', 'PhD', 'BS'],
...     'approved': [True, False, True, True]
... })
>>> result = _ambiguous_columns_split(
...     pd_dataframe=df,
...     target_column='approved'
... )
>>> result['numeric'].columns
Index(['age', 'income', 'approved'], dtype='object')
...
>>> result['categorical'].columns
Index(['city', 'education', 'approved'], dtype='object')