check_linearity
check_linearity
check_linearity.py Module for analyzing linear relationships between numeric features and a target variable.
This module provides tools for identifying features in a pandas DataFrame that have a strong linear relationship with a specified numeric target column using Pearson correlation.
Functions
- check_linearity(df, target, threshold=0.7) Identifies numeric features with absolute Pearson correlation above a given threshold relative to the target column.
Functions
| Name | Description |
|---|---|
| check_linearity | Identify features with a specified strength of linear relationship to the target. |
check_linearity
check_linearity.check_linearity(df, target, threshold=0.7)
Identify features with a specified strength of linear relationship to the target.
This function identifies all of the numeric features in a DataFrame and computes the Pearson correlation coefficient between each numeric feature in the DataFrame and the specified numeric target column. It returns a DataFrame containing the features whose absolute correlation with the target is greater than or equal to the given threshold along with their correlation values.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
df |
pandas.DataFrame | Input DataFrame with feature columns and the target column. Only numeric features will be considered. | required |
target |
str | Name of the target column. The column must be numeric and the name must match the column name in the data. | required |
threshold |
float | Minimum absolute Pearson correlation required for a feature to be considered strongly correlated with the target. Must be between 0 and 1. Default is 0.7. | 0.7 |
Returns
| Type | Description |
|---|---|
| pandas.DataFrame | A DataFrame with the following columns: - feature : str Name of the feature column. - correlation : float Pearson correlation coefficient between the feature and the target. The DataFrame is sorted by absolute correlation in descending order. |
Examples
>>> df_example = pd.DataFrame({
... "sqft": [500, 700, 900, 1100],
... "num_rooms": [1, 2, 1, 3],
... "age": [40, 25, 20, 5],
... "distance_to_city": [10, 12, 11, 13],
... "price": [150, 210, 260, 320]
... })
>>> check_linearity(df=df_example, target="price", threshold=0.7)
feature correlation
0 sqft 0.994
1 age -0.952
2 num_rooms 0.703