check_linearity

check_linearity

check_linearity.py Module for analyzing linear relationships between numeric features and a target variable.

This module provides tools for identifying features in a pandas DataFrame that have a strong linear relationship with a specified numeric target column using Pearson correlation.

Functions

  • check_linearity(df, target, threshold=0.7) Identifies numeric features with absolute Pearson correlation above a given threshold relative to the target column.

Functions

Name Description
check_linearity Identify features with a specified strength of linear relationship to the target.

check_linearity

check_linearity.check_linearity(df, target, threshold=0.7)

Identify features with a specified strength of linear relationship to the target.

This function identifies all of the numeric features in a DataFrame and computes the Pearson correlation coefficient between each numeric feature in the DataFrame and the specified numeric target column. It returns a DataFrame containing the features whose absolute correlation with the target is greater than or equal to the given threshold along with their correlation values.

Parameters

Name Type Description Default
df pandas.DataFrame Input DataFrame with feature columns and the target column. Only numeric features will be considered. required
target str Name of the target column. The column must be numeric and the name must match the column name in the data. required
threshold float Minimum absolute Pearson correlation required for a feature to be considered strongly correlated with the target. Must be between 0 and 1. Default is 0.7. 0.7

Returns

Type Description
pandas.DataFrame A DataFrame with the following columns: - feature : str Name of the feature column. - correlation : float Pearson correlation coefficient between the feature and the target. The DataFrame is sorted by absolute correlation in descending order.

Examples

>>> df_example = pd.DataFrame({
...     "sqft": [500, 700, 900, 1100],
...     "num_rooms": [1, 2, 1, 3],
...     "age": [40, 25, 20, 5],
...     "distance_to_city": [10, 12, 11, 13],
...     "price": [150, 210, 260, 320]
... })
>>> check_linearity(df=df_example, target="price", threshold=0.7)
    feature  correlation
0       sqft        0.994
1         age       -0.952
2   num_rooms        0.703