simplify.all_distributions

simplify.all_distributions(
    pd_dataframe,
    target_column,
    categorical_target,
    max_categories=10,
    categorical_features=None,
    ambiguous_column_types=None,
)

Generate distribution visualizations (e.g., histograms and bar charts) for numeric and categorical columns in a DataFrame.

This is the main function for column-level EDA distribution visualizations. The function automatically infers whether columns are numeric or categorical. Allows manual overrides for ambiguous columns, ambiguous columns are cases where a numeric datatype column should be treated as categorical or vice versa.

Parameters

Name Type Description Default
pd_dataframe pandas.DataFrame Input DataFrame containing the data to be analyzed. Expects a tidy dataframe (one value or string per cell) but can handle some common messy data issue such as incorrect datatypes via ambiguous_column_types parameter. required
target_column str The name of the target column. Funneled to all subfunctions. required
categorical_target bool A boolean value indicating if the target column is categorical or not. required
max_categories int The maximum categories to plot for high cardinality features. Funneled to categorical_plot function 10
categorical_features list Subset of columns to use for categorical plots. If this is not passed, keep all. Subset of columns to include in the analysis. Invalid or non-existent column names are ignored. None
ambiguous_column_types dict Dictionary specifying column type overrides for ambiguous cases. Expected keys are "numeric" and "categorical", with values being lists of column names to force into each category. If a column appears in both lists, will raise a ValueError. Invalid or non-existent column names are ignored. Example: ambiguous_column_types = {“numeric” : [‘year’], “categorical”: [’zip_code]} None

Returns

Name Type Description
dict This function produces distribution plots as a side effect and returns a dictionary of plots types: {“numeric” : cat_plots, “categorical”: numeric_plots}. Currently the categorical_plots contains plots in the form of an appended plot object / list, and numeric_plots contains plots organized in a dictionary according to plot type.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 40, 45],
...     'income': [50000, 60000, 75000, 80000, 90000],
...     'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'],
...     'approved': ['Yes', 'No', 'Yes', 'Yes', 'No']
... })
>>> plots = all_distributions(
...     pd_dataframe=df,
...     target_column='approved',
...     categorical_target=True
... )
>>> plots.keys()
dict_keys(['numeric', 'categorical'])
>>> numeric_plots = plots['numeric']
>>> categorical_plots = plots['categorical']