csvplus

Overview

In real-world data work, CSV files are rarely clean, stable, or small. They often contain inconsistent values, evolve over time, and can be large enough to cause performance issues when loaded naively.

This tutorial demonstrates a complete CSV auditing workflow using the csvplus package. By the end, you will know how to:

  • Load large CSV files efficiently
  • Resolve inconsistent string values
  • Generate concise data summaries
  • Compare multiple versions of a dataset
  • Produce human-readable change reports

The goal is not just to show individual functions, but to illustrate how they work together in a realistic data science workflow.

Scenario: Auditing a Partner Dataset

Suppose you receive a CSV file from a partner organization containing company-level data. The file:

  • Is relatively large
  • Contains inconsistent company name spellings
  • Has been updated since the last delivery

You want to load the data safely, clean it, understand its structure, and identify what has changed between versions.

Step 1: Efficiently Loading a Large CSV

Rather than using pandas.read_csv() directly, csvplus provides load_optimized_csv() to reduce memory usage automatically.

from csvplus.load_optimized_csv import load_optimized_csv

# Load a large csv dataset
df_v1 = load_optimized_csv(
    "large_dataset.csv",
    usecols=["company", "revenue", "employees", "region"],
    no_category_cols=["company"]
)

# Inspect memory usage
df_v1.info(memory_usage="deep")

What this function does

  • Downcasts numeric columns where possible
  • Converts low-cardinality string columns to category
  • Uses sparse data types for columns with many missing values
  • Reduces memory usage without changing the data semantics

This is especially useful when working with large or repeated CSV ingestions.

Step 2: Resolving Inconsistent String Values

Text fields such as company names are often inconsistent due to typos or formatting differences. The resolve_string_value() function uses fuzzy matching to standardize values.

import pandas as pd
from csvplus.data_correction import resolve_string_value

data = pd.DataFrame({
     "company": ["Google", "Google Inc.",
     "Gogle", "Microsoftt", "Micro-soft"],
     "num_searches": [1, 2, 3, 4, 5]
     })
resolve_string_value(
    data,
    column_name="company",
    resolved_names=["Google", "Microsoft", "Amazon"],
    threshold=80
)
print(data)
     company  num_searches
0     Google             1
1     Google             2
2     Google             3
3  Microsoft             4
4  Microsoft             5

After running this function:

  • Similar spellings are mapped to a canonical value
  • Changes are applied in-place to the DataFrame
  • Downstream analysis becomes more reliable

Step 3: Generating a Data Summary Report

Before performing deeper analysis, it is often helpful to understand the structure and quality of the dataset.

import pandas as pd
from csvplus.generate_report import summary_report

df = pd.DataFrame({
    'age': [25, 21, 32, None, 40],
    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA']
    })
numeric_stats, categorical_stats = summary_report(df)

numeric_stats.loc['age', 'mean']
numeric_stats.loc['age', 'n_missing']
numeric_stats.loc['age', 'missing_prop']
numeric_stats.loc['age', 'ci_lower']

numeric_stats.head()
categorical_stats.head()
count n_missing missing_prop n_unique unique_prop is_constant top_values top_1_prop
city 5 0 0.0 3 0.6 False {'NYC': 2, 'LA': 2, 'SF': 1} 0.4

The numeric summary includes:

  • Missing value proportions
  • Distribution statistics
  • Confidence intervals

To inspect categorical columns:

categorical_stats.loc['city', 'n_unique']
categorical_stats.loc['city', 'n_unique']
categorical_stats.loc['city', 'top_values']
categorical_stats.loc['city', 'unique_prop']
np.float64(0.6)

This makes it easy to identify high-cardinality columns, constant values, and potential data quality issues.

Step 4: Comparing Dataset Versions

A week later, you receive an updated CSV file of the original CSV with potential schema and data changes. At this point, you want to understand what changed compared to the original dataset. After loading the dataset, the data_version_diff() function computes a structured comparison between the two DataFrames.

import pandas as pd
from csvplus.data_version_diff import data_version_diff

# Original dataset
df_v1 = pd.DataFrame({
    "id": [1, 2, 3],
    "value": [10, 20, 30],
    "status": [1, 0, 1]
})

# Updated dataset
df_v2 = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "value": ["10", "25", "30", "40"],  # dtype change: int → str
    "category": ["A", "B", None, "C"],
    "amount": [100, 200, 300, 400]
})

# Compute differences
diff = data_version_diff(df_v1, df_v2)
diff
{'columns_added': ['amount', 'category'],
 'columns_removed': ['status'],
 'row_count_change': {'old_row_count': 3,
  'new_row_count': 4,
  'row_difference': 1},
 'missing_value_changes':   column  missing_old  missing_new  difference
 0     id            0            0           0
 1  value            0            0           0,
 'numeric_summary_changes':   column statistic  old       new  difference
 0     id      mean  2.0  2.500000    0.500000
 1     id       std  1.0  1.290994    0.290994
 2     id       min  1.0  1.000000    0.000000
 3     id       max  3.0  4.000000    1.000000,
 'dtype_changes':   column old_dtype new_type
 0  value     int64   object,
 'summary': {'n_columns_added': 2,
  'n_columns_removed': 1,
  'n_dtype_changes': 1,
  'n_missing_changes': 0}}

The resulting object contains detailed information about:

  • Row count changes
  • Columns added or removed
  • Data type changes
  • Missing value differences
  • Numeric summary shifts

Step 5: Inspecting Dataframe Changes Programmatically

You can explore specific components of the diff object directly:

diff["columns_added"]
['amount', 'category']
diff["row_count_change"]
{'old_row_count': 3, 'new_row_count': 4, 'row_difference': 1}
diff["missing_value_changes"]
column missing_old missing_new difference
0 id 0 0 0
1 value 0 0 0
diff["numeric_summary_changes"]
column statistic old new difference
0 id mean 2.0 2.500000 0.500000
1 id std 1.0 1.290994 0.290994
2 id min 1.0 1.000000 0.000000
3 id max 3.0 4.000000 1.000000

This is useful for building automated QA checks or data validation pipelines.

Step 6: Displaying a Human-Readable Report

For interactive use, csvplus provides a clean, console-friendly summary of changes in your dataframes from step 5.

from csvplus.data_version_diff import display_data_version_diff

display_data_version_diff(diff)

============================================================
DATA VERSION CHANGE SUMMARY
============================================================

  ROWS CHANGE:
------------------------------------------------------------
    Old Rows: 3
    New Rows: 4
    Change: +1

   SCHEMA CHANGES:
------------------------------------------------------------
    Columns added: amount, category
    Columns removed: status

   MISSING VALUE CHANGES:
------------------------------------------------------------
 No changes in missing values.

   NUMERIC SUMMARY CHANGES:
------------------------------------------------------------
column statistic  old  new  difference
    id      mean  2.0 2.50        0.50
    id       std  1.0 1.29        0.29
    id       max  3.0 4.00        1.00

   DATA TYPE CHANGES:
------------------------------------------------------------
column old_dtype new_type
 value     int64   object

This produces a readable overview of:

  • Structural changes
  • Data quality differences
  • Summary statistic shifts

Ideal for notebooks, terminals, and exploratory analysis.

When to Use csvplus

Use csvplus when you need to:

  • Load large CSV files safely and efficiently
  • Standardize inconsistent string values
  • Quickly summarize unfamiliar datasets
  • Track changes between dataset versions
  • Perform lightweight data audits

When NOT to Use csvplus

While csvplus is designed to make working with CSV files easier and safer, it is not always the right tool. You may want to avoid using this package in the following situations:

  • You are working with very large datasets (GBs or TBs) that require distributed processing. In those cases, tools like Spark, DuckDB, or database-native ingestion pipelines will be more appropriate.

  • Your data is not tabular (e.g., nested JSON, logs, images, or free text). csvplus is intentionally scoped to CSV-like, row–column data.

  • You need real-time or streaming ingestion. This package is aimed at batch-style analysis and auditing, not streaming pipelines.

  • You already have a well-tested, stable CSV pipeline that meets your needs. csvplus is most valuable when dealing with messy, unfamiliar, or externally sourced CSV files.

  • You need advanced statistical modeling or machine learning. csvplus focuses on inspection, validation, and comparison — it pairs well with pandas, scikit-learn, or statsmodels rather than replacing them.

Summary

This tutorial demonstrated how csvplus supports an end-to-end CSV auditing workflow. Rather than addressing isolated tasks, the package enables you to move confidently from raw data ingestion to version comparison and reporting.

By combining efficient loading, cleaning utilities, summary statistics, and version diffs, csvplus helps make messy CSV data more manageable and transparent.