check_multicollinearity_vif
check_multicollinearity_vif(X, *, target_column=None, warn_threshold=5.0, severe_threshold=10.0, categorical='error', drop_constant=True)
Compute multicollinearity diagnostics using Variance Inflation Factor (VIF).
Multicollinearity refers to strong linear dependence among predictor variables. It does NOT involve the target variable. VIF is defined for each predictor x_j as:
VIF_j = 1 / (1 - R_j^2)
where R_j^2 is the coefficient of determination from regressing x_j on all other predictors. High VIF indicates inflated variance of coefficient estimates in ordinary least squares (OLS), leading to unstable coefficients.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
X |
pd.DataFrame | DataFrame of predictors (features). Each column is treated as a predictor. If the target column is included, specify it via target_column. |
required |
target_column |
str | Name of the target column to exclude from VIF calculation. If None, assumes X contains only predictors. Raises ValueError if specified but not found in X. | None |
warn_threshold |
float | VIF threshold for flagging features as “warn”. Common heuristic: VIF > 5 suggests moderate multicollinearity. | 5.0 |
severe_threshold |
float | VIF threshold for flagging features as “severe”. Common heuristic: VIF > 10 suggests severe multicollinearity. Must be >= warn_threshold. | 10.0 |
categorical |
CategoricalHandling | How to handle non-numeric columns: - “error”: Raise ValueError if non-numeric columns are present - “drop”: Remove non-numeric columns and report in summary | "error" |
drop_constant |
bool | Whether to drop constant columns (where all values are identical). - If True: Constant columns are removed and reported in summary - If False: May raise ValueError during VIF computation due to singularity | True |
Returns
| Type | Description |
|---|---|
| pd.DataFrame | One row per feature with columns: - “feature” (str): Feature name - “vif” (float): VIF value (may be inf for perfect collinearity) - “level” (str): One of {“ok”, “warn”, “severe”} Rows are sorted by VIF in descending order. |
| dict | Overall diagnostics containing: - “overall_status” (str): Worst level found (“ok”, “warn”, or “severe”) - “n_features” (int): Number of features evaluated - “n_warn” (int): Count of features with warn-level VIF - “n_severe” (int): Count of features with severe-level VIF - “warn_threshold” (float): Echo of input threshold - “severe_threshold” (float): Echo of input threshold - “dropped_non_numeric” (list[str]): Non-numeric columns dropped (if categorical=“drop”) - “dropped_constant” (list[str]): Constant columns dropped (if drop_constant=True) |
Raises
| Type | Description |
|---|---|
| ValueError | - If target_column is specified but not found in X - If categorical=“error” and non-numeric columns exist - If drop_constant=False and constant columns prevent VIF computation - If warn_threshold <= 0 or severe_threshold < warn_threshold - If fewer than 2 features remain after dropping columns - If X contains missing values (NaN/None) in evaluated predictors |
Notes
- VIF measures linear dependence among predictors only, not their relationship with the target variable.
- VIF = inf indicates perfect multicollinearity (one predictor is a perfect linear combination of others).
- The auxiliary regressions used to compute R_j^2 include an intercept term.
- Constant columns have no variance and will cause numerical issues if not dropped.