data_version_diff

data_version_diff

Summarize structural and statistical differences between two DataFrames.

Functions

Name Description
data_version_diff This function compares an earlier and a later version of a pandas
display_data_version_diff Print a formatted, human-readable summary of DataFrame version differences.

data_version_diff

data_version_diff.data_version_diff(df_v1, df_v2)

This function compares an earlier and a later version of a pandas DataFrame and returns a high-level summary of how the data has changed. It is designed for data auditing, version tracking, and exploratory analysis rather than cell-by-cell comparison.

The comparison includes: - Columns that were added or removed - Changes in row counts - Changes in missing values by column - Changes in summary statistics for numeric columns - Changes in data types

Parameters

Name Type Description Default
df_v1 pandas.DataFrame The original or earlier version of the dataset. required
df_v2 pandas.DataFrame The updated or later version of the dataset. required

Returns

Name Type Description
diff dict A dictionary summarizing differences between the two DataFrames.

Notes

  • This function assumes both inputs are pandas DataFrames.
  • Rows are compared by position only; no key-based row matching is performed.
  • The function is intended for small to medium-sized datasets and exploratory analysis rather than large-scale production pipelines.

Examples

>>> import pandas as pd
>>> from csvplus.data_version_diff import data_version_diff
>>>
>>> # Original dataset
>>> df_v1 = pd.DataFrame({
...     "id": [1, 2, 3],
...     "value": [10, 20, 30],
...     "status": [1, 0, 1]
... })
>>>
>>> # Updated dataset
>>> df_v2 = pd.DataFrame({
...     "id": [1, 2, 3, 4],
...     "value": ["10", "25", "30", "40"],
...     "category": ["A", "B", None, "C"],
...     "amount": [100, 200, 300, 400]
... })
>>>
>>> # Compare the two DataFrames
>>> diff = data_version_diff(df_v1, df_v2)
>>>
>>> # Check which columns were added
>>> diff["columns_added"]
>>>
>>> # Check which columns were removed
>>> diff["columns_removed"]
>>>
>>> # Row count change
>>> diff["row_count_change"]
>>>
>>> # Missing value changes
>>> diff["missing_value_changes"]
>>>
>>> # Numeric summary changes
>>> diff["numeric_summary_changes"]

display_data_version_diff

data_version_diff.display_data_version_diff(diff)

Print a formatted, human-readable summary of DataFrame version differences.

This function takes the output of data_version_diff and prints a structured console report highlighting row count changes, schema changes, missing value differences, numeric summary changes, and data type changes.

Parameters

Name Type Description Default
diff dict The dictionary returned by data_version_diff. required

Notes

  • This function is intended for interactive use (e.g., notebooks or terminals).
  • It does not return any value.

Examples

>>> import pandas as pd
>>> from csvplus.data_version_diff import display_data_version_diff
>>>
>>> diff = data_version_diff(df_v1, df_v2)
>>> display_data_version_diff(diff)