generate_report

generate_report

A module that generates a summary report given an input dataframe. Includes useful summary statistics for numeric and categorical data for use in data analysis.

Requires: pandas >= 1.0.0, scipy >= 1.0.0

LLM Usage Disclosure

Claude.ai was used to perform the following tasks:

Provide recommendations for which statistics to include in the output report, given their frequency of use in real-world data analysis.
Generate pseudocode for the confidence interval and proportion calculations.
Look for edge cases in the code and recommend how to best address them, particularly null columns and input DataFrames with extreme small row counts.

Functions

Name	Description
summary_report	For an input DataFrame, generate a summary report including:

summary_report

generate_report.summary_report(df, confidence_level=0.95, top_n=5)

For an input DataFrame, generate a summary report including:

For numeric columns (int, float):

‘count’: number of non-null values
‘n_missing’: number of missing values
‘missing_prop’: proportion of missing values
‘mean’: arithmetic mean
‘ci_lower’: lower bound of confidence interval for mean
‘ci_upper’: upper bound of confidence interval for mean
‘median’: median
‘std’: standard deviation
‘min’: minimum value
‘25%’: first quartile
‘75%’: third quartile
‘max’: maximum value
‘n_unique’: number of unique values

For categorical columns (object, string, category, bool, datetime):

‘count’: number of non-null values
‘n_missing’: number of missing values
‘missing_prop’: proportion of missing values
‘n_unique’: number of unique values
‘unique_prop’: proportion of unique values to total count
‘is_constant’: boolean indicating if only one unique value exists
‘top_values’: dictionary for up to top_n most frequent values
‘top_1_prop’: proportion of most common value

Note: Confidence intervals are calculated using the t-distribution and assume approximately normal data or sufficient sample size (n>=30). Columns with all null values are excluded from output. Numeric columns with fewer than 2 non-null values will have ci_lower and ci_upper set to None.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	DataFrame to obtain summary statistics for.	required
confidence_level	float	Confidence level for calculating confidence intervals for numeric columns. Must be between 0 and 1.	`0.95`
top_n	int	Maximum number of most frequent values to include in top_values for categorical columns.	`5`

Returns

Name	Type	Description
	tuple[pd.DataFrame, pd.DataFrame]	A tuple of (numeric_stats, categorical_stats) where: - numeric_stats: DataFrame with statistics as columns rows are numeric columns indexed by column names from the input DataFrame - categorical_stats: DataFrame with statistics as columns rows are categorical columns indexed by column names from the input DataFrame If no numeric or categorical columns exist, the respective DataFrame will be empty.

Raises

Name	Type	Description
	TypeError	If df is not a pandas.DataFrame.
	ValueError	If df is empty (has no rows), if confidence_level is not between 0 and 1, or if top_n < 1.

Examples

>>> import pandas as pd
>>> from csvplus.generate_report import summary_report
>>>
>>> df = pd.DataFrame({
...     'age': [25, 21, 32, None, 40],
...     'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
... })
>>> numeric_stats, categorical_stats = summary_report(df)
>>> numeric_stats.loc['age', 'mean']
29.5
>>> numeric_stats.loc['age', 'n_missing']
1
>>> numeric_stats.loc['age', 'missing_prop']
0.2
>>> numeric_stats.loc['age', 'ci_lower']
22.3
>>> categorical_stats.loc['city', 'n_unique']
3
>>> categorical_stats.loc['city', 'top_values']
{'NYC': 2, 'LA': 2, 'SF': 1}
>>> categorical_stats.loc['city', 'unique_prop']
0.6