ez-df-data-validator

ez-df-data-validator

Platform Badge
Package TestPyPI
CI/CD CI
Coverage codecov

Summary

2025-26 DSCI-524 Group 23 - Data Cleaner & Validator for Pandas DataFrames

ez-df-data-validator is a project that provides basic, but essential data cleaning functionality for ML workflows. This package provides a lightweight and user friendly toolkit for common data cleaning tasks in Python. It is designed to streamline data preprocessing by offering clear, reusable functions for detecting duplicates, standardizing column names, and handling missing values. The goal is to reduce repetitive code and make data preparation more efficient and reproducible.

Installation

Install for regular use:

Create a new folder using mkdir test_ez and run the below command:

pip install -i https://test.pypi.org/simple/ ez-df-data-validator

Requirements

  • Python 3.10+

Example usage

Within the test_ez folder, create a test file using touch test_package.py. Copy the below contents into this newly created file.

import pandas as pd
import numpy as np
from ez_df_data_validator import (
    standardize_schema, 
    missing_summary, 
    handle_missing,
    find_duplicates
)

# Create a messy dataset
df = pd.DataFrame({
    "Age ": [25, 25, 30, np.nan],
    "Income($)": [50000, 50000, 60000, 60000],
    "City": ["Van", "Van", "Tor", "Tor"]
})

# Clean headers
df = standardize_schema(df)

# Check for duplicates
duplicates = find_duplicates(df)
print(f"Found {len(duplicates)} duplicate rows")

# Summarize missing values
print(missing_summary(df))

# Handle missing values
df_clean = handle_missing(df, strategy="drop")

Run the script with python test_package.py. It should show an output similar to:

$ python test_package.py
Found 1 duplicate rows
        missing_count  missing_pct
column                            
age                 1         0.25
income              0         0.00
city                0         0.00

Functions

The package provides the following core data validation and cleaning utilities:

Function Description
standardize_schema() Standardize DataFrame column headers, remove duplicate columns and drop constant columns.
find_duplicates() Identifies duplicate rows in a dataset based on one or more specified columns, helping users quickly detect and inspect redundant data.
handle_missing() Handles missing data in input Pandas dataframe so as to speed up the data science pipeline.
missing_summary() Summarizes missing values per column (count and proportion) to help assess data completeness.

Developer Guide

Follow these steps to set up the development environment and contribute to the project.

We use conda to manage dependencies.

# Create and activate environment
conda env create -f environment.yml
conda activate ez_df_data_validator

# Install package with development + testing + docs tools
pip install -e ".[tests,dev,docs]"

# Run tests
pytest
pytest --cov=ez_df_data_validator --cov-report=term-missing --cov-branch

# Build documentation locally
quartodoc build
quarto preview

Continuous Integration

This project uses GitHub Actions for automated testing and code quality checks.
CI workflow includes:

  • Python 3.12 environment
  • Editable package installation with dev/test dependencies
  • Pytest with coverage reporting
  • Ruff linting

Workflows run on pushes and pull requests to main.

Documentation

Project documentation is automatically generated using quartodoc and deployed with GitHub Pages as part of the CI/CD workflow.

Position of this package in the Python Ecosystem

This package is intended to complement existing data science libraries rather than replace them. Core functionality overlaps with well established tools such as pandas and NumPy, which provide operations for data manipulation and cleaning. However, this package focuses on wrapping common data cleaning patterns into simple functions that are easy to use. Similar preprocessing utilities also exist in scikit-learn.

Contributors

  • Gaurang Ahuja
  • Nishanth Kumarasamy
  • Johnson Leung
  • Siting Wang