from csvplus.load_optimized_csv import load_optimized_csv
# Load a large csv dataset
df_v1 = load_optimized_csv(
"large_dataset.csv",
usecols=["company", "revenue", "employees", "region"],
no_category_cols=["company"]
)
# Inspect memory usage
df_v1.info(memory_usage="deep")csvplus
Overview
In real-world data work, CSV files are rarely clean, stable, or small. They often contain inconsistent values, evolve over time, and can be large enough to cause performance issues when loaded naively.
This tutorial demonstrates a complete CSV auditing workflow using the csvplus package. By the end, you will know how to:
- Load large CSV files efficiently
- Resolve inconsistent string values
- Generate concise data summaries
- Compare multiple versions of a dataset
- Produce human-readable change reports
The goal is not just to show individual functions, but to illustrate how they work together in a realistic data science workflow.
Scenario: Auditing a Partner Dataset
Suppose you receive a CSV file from a partner organization containing company-level data. The file:
- Is relatively large
- Contains inconsistent company name spellings
- Has been updated since the last delivery
You want to load the data safely, clean it, understand its structure, and identify what has changed between versions.
Step 1: Efficiently Loading a Large CSV
Rather than using pandas.read_csv() directly, csvplus provides load_optimized_csv() to reduce memory usage automatically.
What this function does
- Downcasts numeric columns where possible
- Converts low-cardinality string columns to
category - Uses sparse data types for columns with many missing values
- Reduces memory usage without changing the data semantics
This is especially useful when working with large or repeated CSV ingestions.
Step 2: Resolving Inconsistent String Values
Text fields such as company names are often inconsistent due to typos or formatting differences. The resolve_string_value() function uses fuzzy matching to standardize values.
import pandas as pd
from csvplus.data_correction import resolve_string_value
data = pd.DataFrame({
"company": ["Google", "Google Inc.",
"Gogle", "Microsoftt", "Micro-soft"],
"num_searches": [1, 2, 3, 4, 5]
})
resolve_string_value(
data,
column_name="company",
resolved_names=["Google", "Microsoft", "Amazon"],
threshold=80
)
print(data) company num_searches
0 Google 1
1 Google 2
2 Google 3
3 Microsoft 4
4 Microsoft 5
After running this function:
- Similar spellings are mapped to a canonical value
- Changes are applied in-place to the DataFrame
- Downstream analysis becomes more reliable
Step 3: Generating a Data Summary Report
Before performing deeper analysis, it is often helpful to understand the structure and quality of the dataset.
import pandas as pd
from csvplus.generate_report import summary_report
df = pd.DataFrame({
'age': [25, 21, 32, None, 40],
'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
})
numeric_stats, categorical_stats = summary_report(df)
numeric_stats.loc['age', 'mean']
numeric_stats.loc['age', 'n_missing']
numeric_stats.loc['age', 'missing_prop']
numeric_stats.loc['age', 'ci_lower']
numeric_stats.head()
categorical_stats.head()| count | n_missing | missing_prop | n_unique | unique_prop | is_constant | top_values | top_1_prop | |
|---|---|---|---|---|---|---|---|---|
| city | 5 | 0 | 0.0 | 3 | 0.6 | False | {'NYC': 2, 'LA': 2, 'SF': 1} | 0.4 |
The numeric summary includes:
- Missing value proportions
- Distribution statistics
- Confidence intervals
To inspect categorical columns:
categorical_stats.loc['city', 'n_unique']
categorical_stats.loc['city', 'n_unique']
categorical_stats.loc['city', 'top_values']
categorical_stats.loc['city', 'unique_prop']np.float64(0.6)
This makes it easy to identify high-cardinality columns, constant values, and potential data quality issues.
Step 4: Comparing Dataset Versions
A week later, you receive an updated CSV file of the original CSV with potential schema and data changes. At this point, you want to understand what changed compared to the original dataset. After loading the dataset, the data_version_diff() function computes a structured comparison between the two DataFrames.
import pandas as pd
from csvplus.data_version_diff import data_version_diff
# Original dataset
df_v1 = pd.DataFrame({
"id": [1, 2, 3],
"value": [10, 20, 30],
"status": [1, 0, 1]
})
# Updated dataset
df_v2 = pd.DataFrame({
"id": [1, 2, 3, 4],
"value": ["10", "25", "30", "40"], # dtype change: int → str
"category": ["A", "B", None, "C"],
"amount": [100, 200, 300, 400]
})
# Compute differences
diff = data_version_diff(df_v1, df_v2)
diff{'columns_added': ['amount', 'category'],
'columns_removed': ['status'],
'row_count_change': {'old_row_count': 3,
'new_row_count': 4,
'row_difference': 1},
'missing_value_changes': column missing_old missing_new difference
0 id 0 0 0
1 value 0 0 0,
'numeric_summary_changes': column statistic old new difference
0 id mean 2.0 2.500000 0.500000
1 id std 1.0 1.290994 0.290994
2 id min 1.0 1.000000 0.000000
3 id max 3.0 4.000000 1.000000,
'dtype_changes': column old_dtype new_type
0 value int64 object,
'summary': {'n_columns_added': 2,
'n_columns_removed': 1,
'n_dtype_changes': 1,
'n_missing_changes': 0}}
The resulting object contains detailed information about:
- Row count changes
- Columns added or removed
- Data type changes
- Missing value differences
- Numeric summary shifts
Step 5: Inspecting Dataframe Changes Programmatically
You can explore specific components of the diff object directly:
diff["columns_added"]['amount', 'category']
diff["row_count_change"]{'old_row_count': 3, 'new_row_count': 4, 'row_difference': 1}
diff["missing_value_changes"]| column | missing_old | missing_new | difference | |
|---|---|---|---|---|
| 0 | id | 0 | 0 | 0 |
| 1 | value | 0 | 0 | 0 |
diff["numeric_summary_changes"]| column | statistic | old | new | difference | |
|---|---|---|---|---|---|
| 0 | id | mean | 2.0 | 2.500000 | 0.500000 |
| 1 | id | std | 1.0 | 1.290994 | 0.290994 |
| 2 | id | min | 1.0 | 1.000000 | 0.000000 |
| 3 | id | max | 3.0 | 4.000000 | 1.000000 |
This is useful for building automated QA checks or data validation pipelines.
Step 6: Displaying a Human-Readable Report
For interactive use, csvplus provides a clean, console-friendly summary of changes in your dataframes from step 5.
from csvplus.data_version_diff import display_data_version_diff
display_data_version_diff(diff)
============================================================
DATA VERSION CHANGE SUMMARY
============================================================
ROWS CHANGE:
------------------------------------------------------------
Old Rows: 3
New Rows: 4
Change: +1
SCHEMA CHANGES:
------------------------------------------------------------
Columns added: amount, category
Columns removed: status
MISSING VALUE CHANGES:
------------------------------------------------------------
No changes in missing values.
NUMERIC SUMMARY CHANGES:
------------------------------------------------------------
column statistic old new difference
id mean 2.0 2.50 0.50
id std 1.0 1.29 0.29
id max 3.0 4.00 1.00
DATA TYPE CHANGES:
------------------------------------------------------------
column old_dtype new_type
value int64 object
This produces a readable overview of:
- Structural changes
- Data quality differences
- Summary statistic shifts
Ideal for notebooks, terminals, and exploratory analysis.
When to Use csvplus
Use csvplus when you need to:
- Load large CSV files safely and efficiently
- Standardize inconsistent string values
- Quickly summarize unfamiliar datasets
- Track changes between dataset versions
- Perform lightweight data audits
When NOT to Use csvplus
While csvplus is designed to make working with CSV files easier and safer, it is not always the right tool. You may want to avoid using this package in the following situations:
You are working with very large datasets (GBs or TBs) that require distributed processing. In those cases, tools like Spark, DuckDB, or database-native ingestion pipelines will be more appropriate.
Your data is not tabular (e.g., nested JSON, logs, images, or free text). csvplus is intentionally scoped to CSV-like, row–column data.
You need real-time or streaming ingestion. This package is aimed at batch-style analysis and auditing, not streaming pipelines.
You already have a well-tested, stable CSV pipeline that meets your needs. csvplus is most valuable when dealing with messy, unfamiliar, or externally sourced CSV files.
You need advanced statistical modeling or machine learning. csvplus focuses on inspection, validation, and comparison — it pairs well with pandas, scikit-learn, or statsmodels rather than replacing them.
Summary
This tutorial demonstrated how csvplus supports an end-to-end CSV auditing workflow. Rather than addressing isolated tasks, the package enables you to move confidently from raw data ingestion to version comparison and reporting.
By combining efficient loading, cleaning utilities, summary statistics, and version diffs, csvplus helps make messy CSV data more manageable and transparent.