check_homoscedasticity

check_homoscedasticity

Homoscedasticity diagnostics for linear regression.

This module contains utilities to detect heteroscedasticity (non-constant variance) in residuals for linear regression workflows. Heteroscedasticity violates a key assumption of ordinary least squares (OLS) regression and can lead to inefficient estimates and incorrect standard errors.

The module provides the check_homoscedasticity function which implements three widely-used statistical tests: Breusch-Pagan, White, and Goldfeld-Quandt.

Functions

check_homoscedasticity : Test residuals for constant variance

Examples

Basic usage: >>> import pandas as pd >>> import numpy as np >>> from lrassume import check_homoscedasticity >>> >>> X = pd.DataFrame({‘x1’: range(100), ‘x2’: np.random.randn(100)}) >>> y = pd.Series(2 * X[‘x1’] + np.random.randn(100)) >>> results, summary = check_homoscedasticity(X, y) >>> print(summary[‘overall_conclusion’]) ‘homoscedastic’

Notes

All tests assume that residuals come from a linear regression model. If using non-linear models, interpret results with caution.

References

.. [1] Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47(5), 1287-1294.

.. [2] White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817-838.

.. [3] Goldfeld, S. M., & Quandt, R. E. (1965). Some tests for homoscedasticity. Journal of the American Statistical Association, 60(310), 539-547.

Functions

Name	Description
check_homoscedasticity	Test for homoscedasticity (constant variance) in linear regression residuals.

check_homoscedasticity

check_homoscedasticity.check_homoscedasticity(X, y, *, method='breusch_pagan', alpha=0.05, fitted_model=None, residuals=None, fitted_values=None)

Test for homoscedasticity (constant variance) in linear regression residuals.

Homoscedasticity is the assumption that residuals have constant variance across all levels of the independent variables. Violation of this assumption (heteroscedasticity) leads to inefficient coefficient estimates and incorrect standard errors in ordinary least squares (OLS) regression.

This function implements multiple statistical tests to detect heteroscedasticity:

Breusch-Pagan test: Tests whether residual variance depends linearly on predictors. Null hypothesis: homoscedasticity (constant variance).
White test: More general test that allows for non-linear relationships between variance and predictors. Includes squared terms and interactions. Null hypothesis: homoscedasticity.
Goldfeld-Quandt test: Splits data by a predictor and compares variance in two subsets. Useful for detecting variance that increases/decreases with a specific predictor.

Parameters

Name	Type	Description	Default
`X`	pd.DataFrame	DataFrame of predictors (features). Each column is a predictor variable. Must contain only numeric columns.	required
`y`	pd.Series	Target variable (response). Must have the same length as X.	required
`method`	TestMethod	Statistical test(s) to perform: - “breusch_pagan”: Breusch-Pagan Lagrange multiplier test - “white”: White’s general heteroscedasticity test - “goldfeld_quandt”: Goldfeld-Quandt test (splits on first predictor by default) - “all”: Run all available tests	`"breusch_pagan"`
`alpha`	float	Significance level for hypothesis tests. Common values: 0.01, 0.05, 0.10. Must be between 0 and 1 (exclusive).	`0.05`
`fitted_model`	optional	Pre-fitted regression model object with `predict()` method. If None, an OLS model will be fitted internally using X and y. Useful for avoiding refitting when model already exists.	`None`
`residuals`	np.ndarray	Pre-computed residuals (y - y_pred). Must have same length as y. If None, residuals will be computed from fitted_model or internal fit. Cannot be specified without fitted_values.	`None`
`fitted_values`	np.ndarray	Pre-computed fitted values (y_pred). Must have same length as y. If None, fitted values will be computed from fitted_model or internal fit. Cannot be specified without residuals.	`None`

Returns

Type	Description
pd.DataFrame	One row per test performed, with columns: - “test” (str): Name of the test performed - “statistic” (float): Test statistic value, rounded to 3 decimals - “p_value” (float): P-value for the test, rounded to 4 decimals - “conclusion” (str): One of {“homoscedastic”, “heteroscedastic”} - “significant” (bool): True if p_value < alpha (reject null hypothesis) Rows are sorted by test name alphabetically.
dict	Overall diagnostics containing: - “overall_conclusion” (str): “homoscedastic” if all tests pass, otherwise “heteroscedastic” - “n_tests_performed” (int): Number of tests conducted - “n_tests_significant” (int): Number of tests rejecting homoscedasticity - “alpha” (float): Echo of significance level used - “n_observations” (int): Sample size - “n_predictors” (int): Number of predictor variables - “recommendation” (str): Suggested action if heteroscedasticity detected

pd.DataFrame

One row per test performed, with columns: - “test” (str): Name of the test performed - “statistic” (float): Test statistic value, rounded to 3 decimals - “p_value” (float): P-value for the test, rounded to 4 decimals - “conclusion” (str): One of {“homoscedastic”, “heteroscedastic”} - “significant” (bool): True if p_value < alpha (reject null hypothesis) Rows are sorted by test name alphabetically.

dict

Overall diagnostics containing: - “overall_conclusion” (str): “homoscedastic” if all tests pass, otherwise “heteroscedastic” - “n_tests_performed” (int): Number of tests conducted - “n_tests_significant” (int): Number of tests rejecting homoscedasticity - “alpha” (float): Echo of significance level used - “n_observations” (int): Sample size - “n_predictors” (int): Number of predictor variables - “recommendation” (str): Suggested action if heteroscedasticity detected

Raises

Type	Description
ValueError	- If X contains non-numeric columns - If X and y have different lengths - If alpha is not between 0 and 1 - If residuals is provided without fitted_values or vice versa - If residuals/fitted_values length doesn’t match y - If fewer than 10 observations are available (insufficient for testing)
TypeError	- If fitted_model is provided but lacks predict() method - If X is not a pandas DataFrame - If y is not a pandas Series

Notes

All tests assume residuals from a linear regression model.
Tests use chi-square or F-distributions depending on the method.
The Breusch-Pagan test is most powerful against linear heteroscedasticity.
The White test is more general but may have lower power with small samples.
Goldfeld-Quandt test requires ordering data, which may be arbitrary for multivariate predictors.
If heteroscedasticity is detected, consider using robust standard errors (e.g., HC3, HC4) or weighted least squares (WLS) regression.
Missing values in X or y will raise an error; clean data beforehand.

Examples

Basic usage with internal model fitting:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(42)
>>> X = pd.DataFrame({
...     'x1': np.linspace(1, 100, 100),
...     'x2': np.random.randn(100)
... })
>>> y = pd.Series(2 * X['x1'] + 3 * X['x2'] + np.random.randn(100))
>>> test_results, summary = check_homoscedasticity(X, y)
>>> print(summary["overall_conclusion"])
'homoscedastic'

Using a pre-fitted model:

>>> from sklearn.linear_model import LinearRegression
>>> model = LinearRegression().fit(X, y)
>>> test_results, summary = check_homoscedasticity(
...     X, y, fitted_model=model
... )
>>> print(test_results)
          test  statistic   p_value      conclusion  significant
0  breusch_pagan      2.345      0.309  homoscedastic        False

Running all tests:

>>> test_results, summary = check_homoscedasticity(
...     X, y, method="all", alpha=0.01
... )
>>> print(summary["n_tests_performed"])
3
>>> print(summary["n_tests_significant"])
0

Detecting heteroscedasticity (variance increases with x):

>>> X_hetero = pd.DataFrame({
...     'x1': np.linspace(1, 100, 100)
... })
>>> errors = np.random.randn(100) * X_hetero['x1']  # variance increases
>>> y_hetero = pd.Series(2 * X_hetero['x1'] + errors)
>>> test_results, summary = check_homoscedasticity(X_hetero, y_hetero)
>>> print(summary["overall_conclusion"])
'heteroscedastic'
>>> print(summary["recommendation"])
'Consider using robust standard errors (HC3/HC4) or weighted least squares.'

Using pre-computed residuals and fitted values:

>>> model = LinearRegression().fit(X, y)
>>> y_pred = model.predict(X)
>>> resid = y - y_pred
>>> test_results, summary = check_homoscedasticity(
...     X, y,
...     residuals=resid,
...     fitted_values=y_pred
... )
>>> print(test_results)
          test  statistic   p_value      conclusion  significant
0  breusch_pagan      2.345      0.309  homoscedastic        False

References

.. [1] Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47(5), 1287-1294.

.. [2] White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817-838.

.. [3] Goldfeld, S. M., & Quandt, R. E. (1965). Some tests for homoscedasticity. Journal of the American Statistical Association, 60(310), 539-547.