canadataClean
Python package for collection of utility functions to clean and validate Canada-specific data
Welcome to canadataClean Documentation
canadataClean
| Category | Badge |
|---|---|
| Code Coverage | |
| Meta |
canadataClean provides a collection of utility functions for cleaning and validating Canada-specific structured data in pandas DataFrames. The package is designed to help users efficiently standardize common Canadian data fields while identifying invalid or problematic entries.
Summary
This package helps ensure data consistency for Canadian information by formatting and validating phone numbers, postal codes, and city or province names, and by checking dates or dates of birth against Canadian date formats, highlighting any invalid entries.
When a value does not meet the required Canadian format, canadataClean raises a warning-type error to flag the invalid entry while allowing data processing to continue. This makes it easy to identify and address data quality issues without interrupting workflows, while still producing clean, analysis-ready datasets.
Installation
To reproduce the environment, run:
conda env create -f environment.yml
conda activate canadataCleanYou can install this package from TestPyPI into your preferred Python environment using pip:
$ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple canadataCleanTo use canadataClean in your code:
from canadataClean import clean_date, clean_location, clean_phonenumber, clean_postalcodeFunctions
clean_date(date)This function cleans and validates a date string, converting common formats to the Canadian standard YYYY-MM-DD (ISO 8601).
clean_postalcode(postal_code)This function cleans and validates a Canadian postal code string field to ensure that it matches the Canadian postal code format. This is a six-character code defined and maintained by Canada Post Corporation (CPC) for the purpose of sorting and delivering mail. The characters are arranged in the form ‘ANA NAN’, where ‘A’ represents an alphabetic character and ‘N’ represents a numeric character (e.g., K1A 0T6). More information about Canadian postal codes can be found here.
clean_location(location)This function cleans and validates a free-text entry representing Canadian province or territory and returns the two letter province or territory code, e.g. “BC” for “British Columbia”. The abbrieviations for Canadian provinces and territories can be found here.
clean_phonenumber(phone_number)This function cleans and validates a phone number string field to ensure that it matches the Canadian phone number format (“+1 (XXX) XXX-XXXX”) which includes the country code (+1) followed by a 10-digit number.
To run the tests
You can run the tests for this package using pytest. First, install the testing dependencies:
pip install -e.[test]Then, run the tests with:
pytest
To view the test coverage, run the following command:
pytest --cov=src/canadata_clean
Documentation
The online documentation for this package can be found here.
To generate and preview the reference documentation, use the following command:
quartodoc build --watch
quarto previewUsage
- Standardizing Dates: The clean_date function standardizes a string to the Canadian format YYYY-MM-DD (ISO 8601)
from canadataClean import clean_date
cleaned_date = clean_date("date") # Replace date with the actual date- Standardizing Postal Code: The clean_postalcode function standardizes a string to the Canadian postal code format (e.g., “A1A 1A1”)
from canadataClean import clean_postalcode
cleaned_postalcode = clean_postalcode("postal_code") # Replace postal_code with the actual postal_code- Standardizing Provinces and Territories: The clean_location function standardizes a string to the two letter province or territory code (e.g. “BC” for “British Columbia”)
from canadataClean import clean_location
cleaned_location = clean_location("location") # Replace location with the actual province or territory- Standardizing Phone Number: The clean_phonenumber function standardizes a string to the Canadian phone number format (“+1 (XXX) XXX-XXXX”)
from canadataClean import clean_phonenumber
cleaned_phonenumber = clean_phonenumber("phone_number") # Replace phone number with the actual phone numberWhere This Fits in the Python Ecosystem
canadataClean fits into the broader Python data processing and data quality ecosystem, alongside libraries such as pandas and data validation tools like pydantic . While pandas provides flexible, general-purpose tools for data manipulation, and pydantic offers highly configurable rule-based systems, canadataClean focuses on a lightweight and targeted approach to data cleaning.
The package specializes in Canada-specific data standardization and validation, including postal codes, phone numbers, provinces, cities, and date formats. Unlike more general or schema-heavy validation libraries, canadataClean offers simple, string-based utility functions that can be easily integrated into existing pandas workflows. It is designed for users who need fast, consistent cleaning of Canadian datasets without configuring complex validation pipelines, making it well-suited for practical data preparation and preprocessing tasks.
Dependencies
Contributors
- Molly Kessler
- Raymond Wang
- Sasha S
- Randall Lee
Contributing
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
canadataClean was created by Molly Kessler, Raymond Wang, Sasha S, Randall Lee. It is licensed under the terms of the MIT License.
Credits
canadataClean was created with pyopensci.