simplify.all_distributions
simplify.all_distributions(
pd_dataframe,
target_column,
categorical_target,
max_categories=10,
categorical_features=None,
ambiguous_column_types=None,
)Generate distribution visualizations (e.g., histograms and bar charts) for numeric and categorical columns in a DataFrame.
This is the main function for column-level EDA distribution visualizations. The function automatically infers whether columns are numeric or categorical. Allows manual overrides for ambiguous columns, ambiguous columns are cases where a numeric datatype column should be treated as categorical or vice versa.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pd_dataframe | pandas.DataFrame | Input DataFrame containing the data to be analyzed. Expects a tidy dataframe (one value or string per cell) but can handle some common messy data issue such as incorrect datatypes via ambiguous_column_types parameter. | required |
| target_column | str | The name of the target column. Funneled to all subfunctions. | required |
| categorical_target | bool | A boolean value indicating if the target column is categorical or not. | required |
| max_categories | int | The maximum categories to plot for high cardinality features. Funneled to categorical_plot function | 10 |
| categorical_features | list | Subset of columns to use for categorical plots. If this is not passed, keep all. Subset of columns to include in the analysis. Invalid or non-existent column names are ignored. | None |
| ambiguous_column_types | dict | Dictionary specifying column type overrides for ambiguous cases. Expected keys are "numeric" and "categorical", with values being lists of column names to force into each category. If a column appears in both lists, will raise a ValueError. Invalid or non-existent column names are ignored. Example: ambiguous_column_types = {“numeric” : [‘year’], “categorical”: [’zip_code]} |
None |
Returns
| Name | Type | Description |
|---|---|---|
| dict | This function produces distribution plots as a side effect and returns a dictionary of plots types: {“numeric” : cat_plots, “categorical”: numeric_plots}. Currently the categorical_plots contains plots in the form of an appended plot object / list, and numeric_plots contains plots organized in a dictionary according to plot type. |
Examples
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'age': [25, 30, 35, 40, 45],
... 'income': [50000, 60000, 75000, 80000, 90000],
... 'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'],
... 'approved': ['Yes', 'No', 'Yes', 'Yes', 'No']
... })
>>> plots = all_distributions(
... pd_dataframe=df,
... target_column='approved',
... categorical_target=True
... )
>>> plots.keys()
dict_keys(['numeric', 'categorical'])
>>> numeric_plots = plots['numeric']
>>> categorical_plots = plots['categorical']