Welcome to lrassume

lrassume

Package
Meta

lrassume (Linear Regression Assumption Validator) is a Python package for validating the core assumptions of linear regression models. It provides statistical tests and diagnostic tools to assess independence, linearity, multicollinearity, and homoscedasticity in your regression workflows.

Features

Independence Testing: Durbin-Watson test to detect autocorrelation in residuals
Linearity Assessment: Pearson correlation analysis to identify linear relationships with the target
Multicollinearity Detection: Variance Inflation Factor (VIF) calculation with configurable thresholds
Homoscedasticity Testing: Multiple statistical tests (Breusch-Pagan, White, Goldfeld-Quandt) to detect heteroscedasticity

Dependencies

All runtime dependencies are automatically installed when you install lrassume.

Runtime Requirements: - Python ≥ 3.10 - statsmodels - Statistical tests and regression diagnostics - numpy - Numerical computing - pandas - Data manipulation - scipy - Scientific computing

Development Dependencies (for contributors): - Testing: pytest, pytest-cov, pytest-raises, pytest-xdist - Code Quality: black, ruff, pre-commit - Documentation: quartodoc, quarto, jupyter - Build Tools: hatch, pip-audit, twine ___

Ecosystem Context

lrassume complements popular regression and machine learning libraries in the Python ecosystem:

scikit-learn: lrassume provides specialized diagnostic tools for linear regression assumptions, while scikit-learn focuses on model building and prediction. Use lrassume to validate your scikit-learn linear regression models before deployment.
statsmodels: statsmodels offers comprehensive statistical modeling and includes assumption tests, but requires more setup. lrassume provides a streamlined, user-friendly interface specifically designed for assumption checking without additional configuration.
pandas: lrassume works seamlessly with pandas DataFrames, making it easy to integrate into your existing data analysis workflows.

When to use lrassume: Choose lrassume when you need quick, focused assumption validation for linear regression before fitting a model. It’s ideal for educational purposes, exploratory data analysis, and model diagnostics in machine learning pipelines.

Installation

User Setup

This option is recommended if you want to use lrassume in your own projects and do not need to modify the source code.

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ lrassume

Development Setup (Recommended)

This option is recommended if you want to develop, modify, or contribute to lrassume.

This project uses Conda to manage the Python environment and pip to install project dependencies.

1. Clone the Repository and Navigate to the Project Directory

git clone https://github.com/UBC-MDS/lrassume.git
cd lrassume

2. Create and Activate the Conda Environment

From the project root directory:

conda env create -f environment.yml
conda activate lrassume

The environment.yml file installs Python only. All runtime dependencies are specified in pyproject.toml.

3. Install the Package in Editable Mode

pip install -e .

Alternative: Development Without Conda

If you prefer not to use Conda, you can install the package directly using pip:

git clone https://github.com/UBC-MDS/lrassume.git
cd lrassume
pip install -e .

Running the Test Suite Locally (Developers)

The test suite is executed using pytest. In CI this is managed via Hatch, but tests can also be run locally using pytest.

Install pytest

conda install pytest
# or
pip install pytest

Run the tests

pytest

NOTE: If pytest fails, restart your terminal and rerun pytest. Sometimes the pip package manager fails to update and source the .bash_* files, failing to link the Python package to the terminal.

Continuous Integration (Automated Testing)

This project uses GitHub Actions to automatically run the test suite.

The tests are executed automatically on: - Pull requests - Pushes to the main branch - A scheduled weekly run

The test suite is executed using Hatch, which runs the project’s configured pytest test environment across multiple operating systems and Python versions.

No manual action is required to trigger these tests.

The GitHub Actions workflow responsible for running the test suite is located at:

.github/workflows/test.yml

Documentation

The full package documentation is built with Quartodoc and deployed automatically to GitHub Pages.

Live documentation: https://ubc-mds.github.io/lrassume/

Build Documentation Locally (Developers)

To preview documentation changes before pushing:

Ensure you are in the development environment:

   conda activate lrassume

Install documentation dependencies:

   pip install -e ".[docs]"

Build the documentation:

   quartodoc build

Preview the documentation locally:

   quarto preview

This will open the documentation site in your browser.

Update Documentation

To update documentation:

Edit docstrings in Python source files (lrassume/*.py)
Rebuild locally using the steps above to verify changes
Commit and push to your branch

Note: The documentation is automatically generated from your Python docstrings.

Deploy Documentation (Automated)

Documentation deployment is fully automated using GitHub Actions.

On every push to the main branch:

GitHub Actions builds the documentation using Quarto and Quartodoc
The rendered site is deployed to GitHub Pages

No manual deployment steps are required.

The workflow file can be found at:

.github/workflows/docs-publish.yml

Quick Start

Check Independence

This function fits a linear model and checks for autocorrelation in the residuals.

import pandas as pd
from lrassume import check_independence

# Create sample data
df = pd.DataFrame({
    "x1": [1, 2, 3, 4, 5],
    "x2": [2, 4, 5, 7, 8],
    "y": [10, 20, 25, 35, 40]
})

# Check independence of residuals
result = check_independence(df, target="y")

# View results
print(result['dw_statistic']) 
print(result['is_independent'])  
print(result['message'])  
# 0.0727
# False
# Positive autocorrelation detected. Residuals may not be independent.

Interpreting the Durbin-Watson statistic: - 1.5 to 2.5: No significant autocorrelation (residuals are independent) ✓ - < 1.5: Positive autocorrelation detected - > 2.5: Negative autocorrelation detected

Note: The function automatically uses all numeric columns (except the target) as predictors and handles the intercept term internally.

Check Linearity

Identify features with strong linear relationships to the target:

import pandas as pd
from lrassume import check_linearity

df = pd.DataFrame({
    "sqft": [500, 700, 900, 1100],
    "num_rooms": [1, 2, 1, 3],
    "age": [40, 25, 20, 5],
    "price": [150, 210, 260, 320]
})

linear_features = check_linearity(df, target="price", threshold=0.7)
print(linear_features)
#  feature  correlation
# 0    sqft        0.999
# 1     age       -0.990

Check Multicollinearity

Compute Variance Inflation Factors to detect multicollinearity:

import pandas as pd
from lrassume import check_multicollinearity_vif

X = pd.DataFrame({"sqft": [800, 900, 1000, 1100, 1200, 1300, 1400, 1500],
     "bedrooms": [1, 2, 1, 3, 2, 4, 3, 5],
     "age": [30, 5, 40, 10, 25, 15, 35, 20]
})

vif_table, summary = check_multicollinearity_vif(X, warn_threshold=5.0)
print(summary['overall_status'])  # 'ok', 'warn', or 'severe'
# severe
print(vif_table)
#    feature        vif   level
# 0  bedrooms  11.100000  severe
# 1     sqft   9.402273    warn
# 2      age   3.102273      ok

Check Homoscedasticity

Test for constant variance in residuals:

import pandas as pd
import numpy as np
from lrassume import check_homoscedasticity
np.random.seed(123)
X = pd.DataFrame({
    'x1': np.linspace(1, 100, 100),
    'x2': np.random.randn(100)
})
y = pd.Series(2 * X['x1'] + 3 * X['x2'] + np.random.randn(100))

test_results, summary = check_homoscedasticity(X, y, method="breusch_pagan")
print(summary['overall_conclusion'])  # 'homoscedastic' 
print(test_results)
#            test  statistic  p_value     conclusion  significant
# 0  breusch_pagan      1.111   0.5737  homoscedastic        False

Core Assumptions Tested

1. Independence

Residuals should be independent of each other (no autocorrelation). Violations occur in time-series or spatially correlated data.

2. Linearity

The relationship between predictors and the target should be approximately linear. Non-linear relationships may require transformations or non-linear models.

3. Multicollinearity

Predictors should not be highly correlated with each other. High multicollinearity inflates standard errors and makes coefficient estimates unstable.

4. Homoscedasticity

Residuals should have constant variance across all levels of predictors. Heteroscedasticity leads to inefficient estimates and incorrect standard errors.

Advanced Usage

Working with Pre-fitted Models

from sklearn.linear_model import LinearRegression
from lrassume import check_homoscedasticity

model = LinearRegression().fit(X, y)
test_results, summary = check_homoscedasticity(
    X, y, 
    fitted_model=model,
    method="all"  # Run all tests
)

Handling Categorical Variables

from lrassume import check_multicollinearity_vif

# Automatically drop non-numeric columns
vif_table, summary = check_multicollinearity_vif(
    df, 
    target_column='price',
    categorical='drop'
)
print(summary['dropped_non_numeric'])  # Lists dropped columns

Custom Thresholds

# Stricter multicollinearity detection
vif_table, summary = check_multicollinearity_vif(
    X, 
    warn_threshold=3.0,
    severe_threshold=5.0
)

# More conservative homoscedasticity testing
test_results, summary = check_homoscedasticity(
    X, y, 
    alpha=0.01  # 99% confidence level
)

Function Reference

Function	Purpose	Key Parameters
`check_independence()`	Durbin-Watson test for autocorrelation	`df`, `target`
`check_linearity()`	Pearson correlation analysis	`df`, `target`, `threshold`
`check_multicollinearity_vif()`	VIF calculation	`X`, `warn_threshold`, `severe_threshold`
`check_homoscedasticity()`	Heteroscedasticity testing	`X`, `y`, `method`, `alpha`

Interpretation Guidelines

VIF Thresholds

VIF < 5: No concerning multicollinearity
5 ≤ VIF < 10: Moderate multicollinearity (warning)
VIF ≥ 10: Severe multicollinearity (action recommended)

Durbin-Watson Statistic

DW ≈ 2: No autocorrelation (independence satisfied)
DW < 1.5: Positive autocorrelation
DW > 2.5: Negative autocorrelation

Homoscedasticity Tests

p-value > α: Fail to reject null hypothesis (homoscedastic)
p-value ≤ α: Reject null hypothesis (heteroscedastic)

Future Enhancements

While our current implementation covers the four foundational assumptions of linear regression (linearity, independence, homoscedasticity, and multicollinearity), there are several areas we’d like to expand on:

Residual Normality Testing: Add statistical tests and visualizations to check if residuals follow a normal distribution
Outlier Detection: Implement scatter plot visualizations and statistical methods to identify potential outliers in the dataset
Influence Diagnostics: Include Cook’s distance calculations to detect influential data points that may be disproportionately affecting the regression model
Enhanced Visualizations: Add more interactive plotting options for better data exploration
Additional Assumption Tests: Expand to cover other regression diagnostics beyond the core four assumptions

These additions would make the package more comprehensive for users conducting thorough regression diagnostics in their workflows.

Contributing

Contributions are welcome! Please see our Code of Conduct for community guidelines.

License

Free software distributed under the MIT License.

Support

For bug reports and feature requests, please open an issue on GitHub.