generate_report

generate_report

A module that generates a summary report given an input dataframe. Includes useful summary statistics for numeric and categorical data for use in data analysis.

Requires: pandas >= 1.0.0, scipy >= 1.0.0

LLM Usage Disclosure

Claude.ai was used to perform the following tasks:

  • Provide recommendations for which statistics to include in the output report, given their frequency of use in real-world data analysis.
  • Generate pseudocode for the confidence interval and proportion calculations.
  • Look for edge cases in the code and recommend how to best address them, particularly null columns and input DataFrames with extreme small row counts.

Functions

Name Description
summary_report For an input DataFrame, generate a summary report including:

summary_report

generate_report.summary_report(df, confidence_level=0.95, top_n=5)

For an input DataFrame, generate a summary report including:

For numeric columns (int, float):

  • ‘count’: number of non-null values
  • ‘n_missing’: number of missing values
  • ‘missing_prop’: proportion of missing values
  • ‘mean’: arithmetic mean
  • ‘ci_lower’: lower bound of confidence interval for mean
  • ‘ci_upper’: upper bound of confidence interval for mean
  • ‘median’: median
  • ‘std’: standard deviation
  • ‘min’: minimum value
  • ‘25%’: first quartile
  • ‘75%’: third quartile
  • ‘max’: maximum value
  • ‘n_unique’: number of unique values

For categorical columns (object, string, category, bool, datetime):

  • ‘count’: number of non-null values
  • ‘n_missing’: number of missing values
  • ‘missing_prop’: proportion of missing values
  • ‘n_unique’: number of unique values
  • ‘unique_prop’: proportion of unique values to total count
  • ‘is_constant’: boolean indicating if only one unique value exists
  • ‘top_values’: dictionary for up to top_n most frequent values
  • ‘top_1_prop’: proportion of most common value

Note: Confidence intervals are calculated using the t-distribution and assume approximately normal data or sufficient sample size (n>=30). Columns with all null values are excluded from output. Numeric columns with fewer than 2 non-null values will have ci_lower and ci_upper set to None.

Parameters

Name Type Description Default
df pd.DataFrame DataFrame to obtain summary statistics for. required
confidence_level float Confidence level for calculating confidence intervals for numeric columns. Must be between 0 and 1. 0.95
top_n int Maximum number of most frequent values to include in top_values for categorical columns. 5

Returns

Name Type Description
tuple[pd.DataFrame, pd.DataFrame] A tuple of (numeric_stats, categorical_stats) where: - numeric_stats: DataFrame with statistics as columns rows are numeric columns indexed by column names from the input DataFrame - categorical_stats: DataFrame with statistics as columns rows are categorical columns indexed by column names from the input DataFrame If no numeric or categorical columns exist, the respective DataFrame will be empty.

Raises

Name Type Description
TypeError If df is not a pandas.DataFrame.
ValueError If df is empty (has no rows), if confidence_level is not between 0 and 1, or if top_n < 1.

Examples

>>> import pandas as pd
>>> from csvplus.generate_report import summary_report
>>>
>>> df = pd.DataFrame({
...     'age': [25, 21, 32, None, 40],
...     'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
... })
>>> numeric_stats, categorical_stats = summary_report(df)
>>> numeric_stats.loc['age', 'mean']
29.5
>>> numeric_stats.loc['age', 'n_missing']
1
>>> numeric_stats.loc['age', 'missing_prop']
0.2
>>> numeric_stats.loc['age', 'ci_lower']
22.3
>>> categorical_stats.loc['city', 'n_unique']
3
>>> categorical_stats.loc['city', 'top_values']
{'NYC': 2, 'LA': 2, 'SF': 1}
>>> categorical_stats.loc['city', 'unique_prop']
0.6