data_version_diff
data_version_diff
Summarize structural and statistical differences between two DataFrames.
Functions
| Name | Description |
|---|---|
| data_version_diff | This function compares an earlier and a later version of a pandas |
| display_data_version_diff | Print a formatted, human-readable summary of DataFrame version differences. |
data_version_diff
data_version_diff.data_version_diff(df_v1, df_v2)This function compares an earlier and a later version of a pandas DataFrame and returns a high-level summary of how the data has changed. It is designed for data auditing, version tracking, and exploratory analysis rather than cell-by-cell comparison.
The comparison includes: - Columns that were added or removed - Changes in row counts - Changes in missing values by column - Changes in summary statistics for numeric columns - Changes in data types
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df_v1 | pandas.DataFrame | The original or earlier version of the dataset. | required |
| df_v2 | pandas.DataFrame | The updated or later version of the dataset. | required |
Returns
| Name | Type | Description |
|---|---|---|
| diff | dict | A dictionary summarizing differences between the two DataFrames. |
Notes
- This function assumes both inputs are pandas DataFrames.
- Rows are compared by position only; no key-based row matching is performed.
- The function is intended for small to medium-sized datasets and exploratory analysis rather than large-scale production pipelines.
Examples
>>> import pandas as pd
>>> from csvplus.data_version_diff import data_version_diff
>>>
>>> # Original dataset
>>> df_v1 = pd.DataFrame({
... "id": [1, 2, 3],
... "value": [10, 20, 30],
... "status": [1, 0, 1]
... })
>>>
>>> # Updated dataset
>>> df_v2 = pd.DataFrame({
... "id": [1, 2, 3, 4],
... "value": ["10", "25", "30", "40"],
... "category": ["A", "B", None, "C"],
... "amount": [100, 200, 300, 400]
... })
>>>
>>> # Compare the two DataFrames
>>> diff = data_version_diff(df_v1, df_v2)
>>>
>>> # Check which columns were added
>>> diff["columns_added"]
>>>
>>> # Check which columns were removed
>>> diff["columns_removed"]
>>>
>>> # Row count change
>>> diff["row_count_change"]
>>>
>>> # Missing value changes
>>> diff["missing_value_changes"]
>>>
>>> # Numeric summary changes
>>> diff["numeric_summary_changes"]display_data_version_diff
data_version_diff.display_data_version_diff(diff)Print a formatted, human-readable summary of DataFrame version differences.
This function takes the output of data_version_diff and prints a structured console report highlighting row count changes, schema changes, missing value differences, numeric summary changes, and data type changes.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| diff | dict | The dictionary returned by data_version_diff. |
required |
Notes
- This function is intended for interactive use (e.g., notebooks or terminals).
- It does not return any value.
Examples
>>> import pandas as pd
>>> from csvplus.data_version_diff import display_data_version_diff
>>>
>>> diff = data_version_diff(df_v1, df_v2)
>>> display_data_version_diff(diff)