Predicting Diabetes in Pima Indian Women Using Logistic Regression¶
By Inder Khera, Javier Martinez, Jenny Zhang & Jessica Kuo (alphabetically ordered), 2024/11/23
import numpy as np
import pandas as pd
import altair as alt
import pandera as pa
import altair_ally as aly
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import (
RandomizedSearchCV,
cross_validate,
cross_val_score,
train_test_split,
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import (
ClassImbalance,
PercentOfNulls,
OutlierSampleDetection,
DataDuplicates,
MixedDataTypes,
FeatureLabelCorrelation,
FeatureFeatureCorrelation
)
from deepchecks.tabular.checks.data_integrity import PercentOfNulls
import warnings
import json
import logging
Summary¶
This study evaluated logistic regression for predicting diabetes in Pima Indian women using features such as glucose levels, BMI, and pregnancies. The model achieved 80% accuracy on the test set, outperforming the baseline dummy classifier's 66.48%. Glucose was the most significant predictor, followed by BMI and pregnancies, while blood pressure and insulin had weaker impacts. The model misclassified 46 cases, with 31 false negatives and 15 false positives, highlighting areas for improvement.
The results indicate that logistic regression is a promising tool for diabetes screening, providing an efficient way to identify potential cases. However, the high number of false negatives is concerning, as they could lead to delayed diagnoses and treatments. Future improvements could include feature engineering to address misclassifications, testing alternative machine learning models, and incorporating additional data, such as lifestyle or genetic factors. Adding probability estimates for predictions could also enhance its clinical usability by helping prioritize further diagnostic tests. These steps could make the model more reliable and practical for real-world healthcare applications.
Introduction¶
Diabetes is a serious chronic disease characterized by high levels of glucose in the blood, caused by either insufficient insulin production by the pancreas or the body’s inability to effectively use insulin. It has become a significant global health issue, with its prevalence nearly doubling since 1980, and in 2022, 14% of adults aged 18 and older were diagnosed with diabetes, doubling from 7% in 1990 (World Health Organization). Diabetes can lead to severe complications, including blindness, kidney failure, heart attacks, strokes, and lower limb amputations. Early detection enables timely interventions, reduces complications, lowers healthcare costs, and improves quality of life and long-term outcomes (Marshall & Flyvbjerg, 2006).
Artificial intelligence (AI) leverages computer systems and big data to simulate intelligent behavior with minimal human intervention, and within it, machine learning (ML) is a subset of AI methodologies. Since the rise of AI, Machine learning has increasingly been applied in various areas of disease detection and prevention in the healthcare field (Bini, 2018). Numerous machine learning techniques have been deployed to develop more efficient and effective methods for diagnosing chronic diseases (Battineni, Chinatalapudi, & Amenta, 2020). Utilizing machine learning methods in diabetes research has been proven to be a critical strategy for harnessing large volumes of diabetes-related data to extract valuable insights (Agarwal & Vadiwala, 2022). Therefore, The goal of this report is to leverage a supervised machine learning model, logistic regression (LR), to evaluate its predictive performance in diagnosing diabetes using a real-world dataset focused specifically on Pima Indian women aged 21 and older.
Methods and Results¶
Data¶
The dataset that was used for the analysis of this project was created by Jack W Smith, JE Everhart, WC Dickson, WC Knowler, RS Johannes and sourced from the National Librabry of Medicine database from the National Institues of Health. Access to their respective analysis can be found here and access to the dataset can be found via kaggle (Dua & Graff, 2017). The primary objective of the dataset is to enable diagnostic prediction of whether a patient has diabetes based on specific diagnostic measurements. To ensure consistency and relevance, several constraints were applied to the selection of data instances. Specifically, the dataset includes only female patients who are at least 21 years old and of Pima Indian heritage.
Each row/obersvation from the dataset is an individual that identifies to be a part of the Pima (also known as The Akimel O'odham) Indeginous group, located mainly in the Central and Southern regions of the United States. Each observation recorded has summary statistics regarding features that include the Age, BMI, Blood Pressure, Number of Pregnancies, as well as The Diabetes Pedigree Function (which is a score that gives an idea about how much correlation is between person with diabetes and their family history). The dataset offers comprehensive feastures for machine learning analysis.
Analysis¶
Logistic Regression was employed to develop a classification model for predicting whether the patient is diabetic or not (as indicated in the outcome
column of the dataset). All variables from the original dataset were used to train the model. The data was split into 70% for the training set and 30% for the testing set. Hyperparameter tuning was performed using RandomizedSearchCV
, with the accuracy score serving as the classification metric. All variables were standardized just before model fitting. The analysis was conducted using the Python programming language (Van Rossum and Drake, 2009) and several Python packages: numpy (Harris et al., 2020), Pandas (McKinney, 2010), altair (VanderPlas, 2018), altair_ally (Ostblom, 2021) and scikit-learn (Pedregosa et al., 2011). The code used for this analysis and report is available at: https://github.com/UBC-MDS/diabetes_predictor_py
# load data
df_original = pd.read_csv('../data/diabetes.csv')
df_original
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 9 columns
Validation Ranges¶
- For
Outcome
each observation must have a value of either 0 or 1 - For
Pregnancies
we have chosen the range of 0-15; although having 15 pregnancies is very unlikely there has been rare historic cases for this upper limit. Anything beyond 15 is too high of an outlier to keep within the dataset. - For
Glucose
, we chose the range of 50-240 as any glucose level below 50 and above 240 would require immediate medical attention, therefore these values are outliers that are not sustainable and in turn not fit for our model. - For
BloodPressure
the acceptable range of 40-180 was chosen as any blood pressure level below 40 will cause dizziness, fainting, shock-like symptoms, cold/clammy skin, or rapid breathing. Meanwhile anything above 180 is a dangerourly high level and a Hypertensive Crisis. In both cases immediate medical attention is required therefore these values are not suitable for our classification model. - For
SkinThickness
(in terms of the Triceps' skin fold) the thickness of skin cannot be below 0 mm and any level of thickness above 80 mm is a very extreme and unlikely level of obesity. Meaning any value below 0 mm would be an error and any value above 80 mm is a very high outlier - For
Insulin
the acceptable range is 0-800. Although having a insulin level of 0 is unlikely there are instances where the insulin level is so low that it is not detected during a test giving the value of 0. Regarding the upper limit of 700 anything above this range considered severe hyperglycemia and needs immediate emergency treatment, thus not suitable when looking at ranges that do not include outliers - For
BMI
the range of 0-65 was chosen as your BMI cannot be lower than 0 and going over 65 is an extreme case of obesity. - For
DiabetesPedigreeFunction
repersents the genetic risk of diabetes based on family history. Therefore this value cannot be smaller than 0 (which repersents having no familial diabetic history), meanwhile the upper limit of 2.5 repersents having both parents and other close family members having diabetic history, which is an unlikely value to encounter. - For
Age
we have chosen the range 18-90 as this test was conducted on adults, we have chosen the upper limit of 90 as it is unlikely for someone to be above 90.
The shape
attribute shows us the number of observations and the number of features in the dataset
# Configure logging
logging.basicConfig(
filename="validation_errors.log",
filemode="w",
format="%(asctime)s - %(message)s",
level=logging.INFO,
)
# Define schema
schema = pa.DataFrameSchema(
{
"Outcome": pa.Column(int, pa.Check.isin([0, 1])),
"Pregnancies": pa.Column(int, pa.Check.between(0, 15), nullable=True),
"Glucose": pa.Column(int, pa.Check.between(50, 240), nullable=True),
"BloodPressure": pa.Column(int, pa.Check.between(40, 180), nullable=True),
"SkinThickness": pa.Column(int, pa.Check.between(0, 80), nullable=True),
"Insulin": pa.Column(int, pa.Check.between(0, 800), nullable=True),
"BMI": pa.Column(float, pa.Check.between(0, 65), nullable=True),
"DiabetesPedigreeFunction": pa.Column(float, pa.Check.between(0, 2.5), nullable=True),
"Age": pa.Column(int, pa.Check.between(18, 90), nullable=True),
},
checks=[
pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
],
drop_invalid_rows=False, # Ensure this is properly closed
)
# Initialize error cases DataFrame
error_cases = pd.DataFrame()
data = df_original.copy()
# Validate data and handle errors
try:
validated_data = schema.validate(data, lazy=True)
except pa.errors.SchemaErrors as e:
error_cases = e.failure_cases
# Convert the error message to a JSON string
error_message = json.dumps(e.message, indent=2)
logging.error("\n" + error_message)
# Filter out invalid rows based on the error cases
if not error_cases.empty:
invalid_indices = error_cases["index"].dropna().unique()
df = (
data.drop(index=invalid_indices)
.reset_index(drop=True)
.drop_duplicates()
.dropna(how="all")
)
else:
df = data
df
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
714 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
715 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
716 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
717 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
718 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
719 rows × 9 columns
# EDA
df.shape
(719, 9)
The info()
method shows that the dataset does not have any features with missing values, and all features are numeric.
# EDA
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 719 entries, 0 to 718 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 719 non-null int64 1 Glucose 719 non-null int64 2 BloodPressure 719 non-null int64 3 SkinThickness 719 non-null int64 4 Insulin 719 non-null int64 5 BMI 719 non-null float64 6 DiabetesPedigreeFunction 719 non-null float64 7 Age 719 non-null int64 8 Outcome 719 non-null int64 dtypes: float64(2), int64(7) memory usage: 50.7 KB
Using the train_test_split()
function we will split our data set with 70% going to train the model and 30% going towards testing the model.
# Create the split
train_df, test_df = train_test_split(df,
train_size = 0.7,
random_state=123)
The describe()
shows us the summary statistics of each of our features as well as our target value. We can see the mean as well as the spread (standard deviation). Using this information and the visualization tools we will see next we can determine how skewed each of our features are for their respective values.
# Explore training data
census_summary = train_df.describe()
census_summary
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 503.000000 | 503.000000 | 503.000000 | 503.000000 | 503.000000 | 503.000000 | 503.000000 | 503.000000 | 503.000000 |
mean | 3.813121 | 121.524851 | 73.025845 | 21.045726 | 81.982107 | 31.940954 | 0.482628 | 33.087475 | 0.328032 |
std | 3.360905 | 30.469809 | 12.250267 | 15.366826 | 113.203347 | 7.218791 | 0.349740 | 11.842942 | 0.469964 |
min | 0.000000 | 56.000000 | 40.000000 | 0.000000 | 0.000000 | 0.000000 | 0.085000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.500000 | 64.000000 | 0.000000 | 0.000000 | 26.800000 | 0.246500 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 44.000000 | 31.600000 | 0.381000 | 29.000000 | 0.000000 |
75% | 6.000000 | 139.000000 | 80.000000 | 32.000000 | 130.000000 | 36.450000 | 0.646500 | 41.000000 | 1.000000 |
max | 14.000000 | 199.000000 | 122.000000 | 63.000000 | 744.000000 | 59.400000 | 2.420000 | 81.000000 | 1.000000 |
# List features
features = census_summary.columns.tolist()
features
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
# Visualize feature distributions
feature_histograms = alt.Chart(train_df).transform_calculate(
).mark_bar(opacity=0.5).encode( x = alt.X(alt.repeat()).type(
'quantitative').bin(maxbins=30), y= alt.Y('count()').stack(False),
color = 'Outcome:N'
).properties( height=250,
width=250 ).repeat(
features, columns=2
)
feature_histograms
Figure 1. Comparison of the empirical distributions of training data predictors between those non-diabetic and diabetic
Figure 1 above illustrates the distribution of each feature, categorized based on the Outcome variable: 0 (Non-Diabetic) and 1 (Diabetic). This visualization provides insights into the relationships between individual features and the target variable.
For the Glucose levels, Non-Diabetic class exhibits a roughly normal distribution, whereas the Diabetic class shows a pronounced shift toward the middle-to-higher range of glucose levels. BMI for the Diabetic class looks like a normal distribution, but it also skews slightly to higher values. However, for the Non-Diabetic class, interestingly, the BMI distribution seems to be more bimodal.
The BMI distribution for the Diabetic class resembles a normal distribution but skews slightly toward higher values. Interestingly, the Non-Diabetic class displays a bimodal pattern, suggesting the presence of distinct subgroups within this category.
The Age distribution reveals that individuals aged 20 to 32 are predominantly Non-Diabetic. Beyond age 32, the counts of Diabetic and Non-Diabetic individuals become comparable, with some bins showing a higher count for the Diabetic class, despite fewer overall observations in this group. The Non-Diabetic class leans toward younger ages, while the Diabetic class has a more even distribution across its age range.
For Pregnancies, the lower range of pregnancies is dominated by the Non-Diabetic class, whereas whereas higher numbers are more common in the Diabetic class.
For Skin Thickness, both the Diabetic and Non-Diabetic classes approximates a normal distribution; however, the Non-Diabetic distribution skews slightly towards lower values, while the Diabetic class skews more towards higher values.
# validate training data for class imbalance for target variable
# Do these on training data as part of EDA!
train_df_ds = Dataset(train_df, label = 'Outcome', cat_features=[])
check_lab_cls_imb = ClassImbalance().add_condition_class_ratio_less_than(0.4)
check_lab_cls_imb_result = check_lab_cls_imb.run(dataset = train_df_ds)
if check_lab_cls_imb_result.passed_conditions():
raise ValueError("Class imbalance exceeds the maximum acceptable threshold.")
# validate training data for percent of nulls
check_pct_nulls = PercentOfNulls().add_condition_percent_of_nulls_not_greater_than(0.05)
check_pct_nulls_result = check_pct_nulls.run(dataset = train_df_ds)
if not check_pct_nulls_result.passed_conditions():
raise ValueError("Percent of nulls exceeds the maximum acceptable threshold for at least one column.")
# validate training data for percent of outlier samples using loOP algo
check_out_sample = (
OutlierSampleDetection(nearest_neighbors_percent = 0.01, extent_parameter = 3)
.add_condition_outlier_ratio_less_or_equal(max_outliers_ratio = 0.001, outlier_score_threshold = 0.9)
)
check_out_sample_result = check_out_sample.run(dataset = train_df_ds)
if not check_out_sample_result.passed_conditions():
raise ValueError("Number of outlier samples exceeds the maximum acceptable threshold.")
# validate training data for data duplicates
# set duplicate condition to 0 as would not expect any two patient with the exact same situation
check_data_dup = DataDuplicates().add_condition_ratio_less_or_equal(0)
check_data_dup_result = check_data_dup.run(dataset = train_df_ds)
if not check_data_dup_result.passed_conditions():
raise ValueError("Data duplicates exceed the maximum acceptable threshold.")
# validate training data for mixed data types across all columns
check_mix_dtype = MixedDataTypes().add_condition_rare_type_ratio_not_in_range((0.01, 0.2))
check_mix_dtype_result = check_mix_dtype.run(dataset = train_df_ds)
if not check_mix_dtype_result.passed_conditions():
# raise a warning instead of an error in this case
warnings.warn("Percentage of rare data type in dangerous zone for at least one column")
# Visualize correlations across features
aly.corr(train_df)
Figure 2. Pearson and Spearman correlations across all features
Figure 2 shows the correlation between all of the respective features. The main reasoning to analyze this is to see if there is any multicollinearity between any of the features which cou;d be problamatic when conducting a Logistic Regression. We see that highest level of correlation is between Age and Pregnancies (0.56 by Pearson, and 0.62 via Spearman). Since this is below the threshold of 0.7, we can conclude that all features' coefficients are suitable and will not cause any multicollinearity in our model.
# Visualize relationships
aly.pair(train_df[features].sample(300), color='Outcome:N')
Figure 3. Pairwise scatterplots between each of features in dataset to visualize relationship
Figure 3 illustrates the relationships between the features. For the most part, the features do not display noticeable trends. However, Skin Thickness and BMI show a moderate visual relationship, which is intuitive since higher body mass is generally associated with increased skin thickness.
Referring back to the correlation graph, Skin Thickness and BMI have a Pearson correlation of 0.39. This value is below the multicollinearity threshold of 0.7, indicating that these features do not pose a risk of multicollinearity in our model.
# validate training data for anomalous correlations between target/response variable
# and features/explanatory variables,
# as well as anomalous correlations between features/explanatory variables
check_feat_lab_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.7)
check_feat_lab_corr_result = check_feat_lab_corr.run(dataset = train_df_ds)
check_feat_feat_corr = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(threshold = 0.7, n_pairs = 0)
check_feat_feat_corr_result = check_feat_feat_corr.run(dataset = train_df_ds)
if not check_feat_lab_corr_result.passed_conditions():
raise ValueError("Feature-Label correlation exceeds the maximum acceptable threshold.")
if not check_feat_feat_corr_result.passed_conditions():
raise ValueError("Feature-feature correlation exceeds the maximum acceptable threshold.")
Here we further split our dataset into X and y for both the training and test:
X_train = train_df.drop(columns = ['Outcome'])
y_train = train_df['Outcome']
X_test = test_df.drop(columns = ['Outcome'])
y_test = test_df['Outcome']
The Dummy Classifier acts as our baseline for conductin our initial analysis. The Dummy Baseline gives us a score of around 0.6648.
# Create Dummy Classifier and cross validation
dummy_clf = DummyClassifier()
mean_cv_score = cross_val_score(dummy_clf,
X_train,
y_train).mean()
mean_cv_score
0.6719603960396039
We will use a Logistic Regression model for classification. Given the presence of outliers in our features, it is advisable to apply StandardScaler() to normalize the feature values before fitting the model. This ensures that all features are on a similar scale, improving the model's performance and stability.
# Create Logistic Regression pipeline
log_pipe=make_pipeline(
StandardScaler(),
LogisticRegression(max_iter=2000,random_state=123)
)
We optimize the hyperparameter C
for our Logistic Regression model using a random search approach.
# Hyperparameter optimization
np.random.seed(123)
param_dist = {
"logisticregression__C": [10**i for i in range(-5,15)]
}
# Create Random Search
random_search = RandomizedSearchCV(log_pipe,param_dist,
n_iter=20,
n_jobs=-1,
return_train_score=True,
random_state=123)
random_search.fit(X_train,y_train)
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression(max_iter=2000, random_state=123))]), n_iter=20, n_jobs=-1, param_distributions={'logisticregression__C': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000, 10000000000000, 100000000000000]}, random_state=123, return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression(max_iter=2000, random_state=123))]), n_iter=20, n_jobs=-1, param_distributions={'logisticregression__C': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000, 10000000000000, 100000000000000]}, random_state=123, return_train_score=True)
Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression(max_iter=2000, random_state=123))])
StandardScaler()
LogisticRegression(max_iter=2000, random_state=123)
Identify the optimal value for the hyperparameter C to be used in our Logistic Regression model.
# Identify optiomized parameter C
best_params = random_search.best_params_
best_params
{'logisticregression__C': 10}
# Display scores
pd.DataFrame(random_search.cv_results_).sort_values(
"rank_test_score").head(3)[["mean_test_score",
"mean_train_score"]]
mean_test_score | mean_train_score | |
---|---|---|
9 | 0.775366 | 0.781808 |
17 | 0.775366 | 0.781808 |
16 | 0.775366 | 0.781808 |
Having determined the best Logistic Regression model for our analysis, we further explore feature importance with coefficients.
# Best model from the search
best_model = random_search.best_estimator_
# Retrieve the coefficients and feature names
coefficients = best_model.named_steps['logisticregression'].coef_.flatten()
features = X_train.columns
# Create a DataFrame to display the feature names and corresponding coefficients
coeff_df = pd.DataFrame({
'Features': features,
'Coefficients': coefficients
})
# Sort by 'Coefficients' in descending order to see the most important features first
coeff_df_sorted = coeff_df.sort_values(by = 'Coefficients', ascending = False)
# Create a heatmap for the coefficients (we will visualize them as a single column)
coeff_df_sorted.style.format(
precision = 2
).background_gradient(
axis = None,
cmap = 'RdBu_r',
low = 0
)
Features | Coefficients | |
---|---|---|
1 | Glucose | 1.15 |
5 | BMI | 0.63 |
0 | Pregnancies | 0.35 |
6 | DiabetesPedigreeFunction | 0.22 |
7 | Age | 0.21 |
3 | SkinThickness | -0.02 |
2 | BloodPressure | -0.08 |
4 | Insulin | -0.14 |
Table 1: Logistic regression feature importance measured by coefficients
Based on the heatmap and table 1 above, the feature importance coefficients for the logistic regression model predicting diabetes reveal that Glucose
(1.08) is the strongest positive influence, followed by BMI
(0.72) and Pregnancies
(0.39). The negative influences BloodPressure
(-0.21) and Insulin
(-0.20) along with the remaining postive features DiabetesPedigreeFunction
(0.29), Age
(0.12), and SkinThickness
(0.04), have a moderate to weak impact on the prediction, with their effects being less pronounced.
We then evaluate the best Logistic Regression model, obtained from the hyperparameter search, on the test set.
# Make predictions using the best model
y_pred = best_model.predict(X_test)
In addition, to enhance the model's practical use in a clinical setting, we are providing and reporting probability estimates for the predictions of diabetes. Offering probability estimates would allow clinicians to gauge the model's confidence in its predictions. This would give clinicians the opportunity to conduct additional diagnostic tests if the predicted probability for the outcome (i.e. diagnosis of prediction) is not sufficiently high.
y_pred_prob = best_model.predict_proba(X_test)
pred_bool = (y_test == y_pred)
pred_results_1 = np.vstack([y_test, y_pred, pred_bool, y_pred_prob[:, 1]])
pred_results_1_df = pd.DataFrame(pred_results_1.T,
columns = ['y_test', 'y_pred', 'pred_bool', 'y_pred_prob_1'])
pred_results_1_df['pred_bool'] = pred_results_1_df['pred_bool'] == 1
pred_results_1_df.head()
y_test | y_pred | pred_bool | y_pred_prob_1 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | True | 0.082355 |
1 | 0.0 | 0.0 | True | 0.303701 |
2 | 0.0 | 0.0 | True | 0.130787 |
3 | 1.0 | 1.0 | True | 0.675814 |
4 | 1.0 | 0.0 | False | 0.451538 |
Our prediction model performed decent on test data, with a final overall accuracy of 0.80. In addition, looking through the prediction results dataframe, there are a total of 46 mistakes. Of which, 31 mistakes were predicting diabetic as non-diabetic (false negatives) and 15 mistakes were made predicting diabetic as non-diabetic (false positives). Considering implementation in clinic, there is room for improvement in the algorithm as false negatives are more harmful than false positives, and we should aim to lower false positives even further.
# Compute accuracy
accuracy = best_model.score(X_test, y_test)
pd.DataFrame({'accuracy': [accuracy]})
accuracy | |
---|---|
0 | 0.75 |
# Calculate the number of correct predictions and misclassifications
value_counts = pred_results_1_df['pred_bool'].value_counts()
pd.DataFrame({
'correct predictions': [value_counts.get(True, 0)],
'misclassifications': [value_counts.get(False, 0)]
})
correct predictions | misclassifications | |
---|---|---|
0 | 162 | 54 |
# Calculate the number of false positives (FPs) and false negatives (FNs)
fp = len(pred_results_1_df[(pred_results_1_df['y_test'] == 0) & (pred_results_1_df['y_pred'] == 1)])
fn = len(pred_results_1_df[(pred_results_1_df['y_test'] == 1) & (pred_results_1_df['y_pred'] == 0)])
pd.DataFrame({
'false positives': [fp],
'false negatives': [fn]
})
false positives | false negatives | |
---|---|---|
0 | 19 | 35 |
Moreover, visualizing prediction probabilities alongside the prediction accuracy for each test sample provides a clearer understanding of the model's performance. This approach allows us to easily assess how well the model predicts, while also highlighting patients who were misdiagnosed. Particularly, it helps us focus on false negatives, as the consequences of these errors are more critical in a clinical context.
alt.Chart(pred_results_1_df, title = 'Test Set Prediction Accuracy').mark_tick().encode(
x = alt.X('y_pred_prob_1').title('Positive Class Prediction Prob'),
y = alt.Y('pred_bool').title('Pred. Accuracy'),
color = alt.Color('y_test:N').title('Outcome')
)
Figure 4. Test Set Prediction Accuracy by Prediction Probability
Discussion¶
While the performance of this model may be valuable as a screening tool in a clinical context, especially given its improvements over the baseline, there are several opportunities for further enhancement. One potential approach is to closely examine the 46 misclassified observations, comparing them with correctly classified examples from both classes. The objective would be to identify which features may be contributing to the misclassifications and investigate whether feature engineering could help the model improve its predictions on the observations it is currently struggling with. Additionally, we would try seeing whether we can get improved predictions using other classifiers. Other classifiers we might try are 1) random forest because it automatically allows for feature interaction, 2) k-nearest neighbours (k-NN) which usually provides easily interpretable and decent predictions, and 3) support vector classifier (SVC) as it allows for non-linear prediction using the rbf kernel. Finally, there runs the possibility that the features offered from this dataset alone are not sufficient to predict with high accuracy. In this case, conducting additional conversations with data collectors for additional useable information or explore additional datasets that can be joined so our set of features can be expanded for more complicated analysis might be beneficial.
At last, we recognize the limitation with this dataset, as it focuses solely on Pima Indian women aged 21 and older, which limits its generalizability to other populations. To improve the analysis, it would be valuable to combine this data with other datasets representing different age groups, genders, and ethnicities, enabling more comprehensive insights and broader applicability of the findings.
Conclusion¶
In conclusion, this study demonstrated the effectiveness of logistic regression in predicting diabetes among Pima Indian women using diagnostic features such as glucose, BMI, and pregnancies. With an accuracy of 80% on the test set, the model significantly outperformed the baseline Dummy Classifier's 66.48%. Glucose was identified as the most influential predictor, followed by BMI and pregnancies, while features like blood pressure and insulin had weaker impacts. However, the model's 46 misclassifications, including 31 false negatives, underscore the need for further refinement to minimize the risk of undiagnosed cases.
These findings highlight logistic regression's potential as an initial screening tool in clinical settings, offering a data-driven approach to early diabetes detection. Nevertheless, improvements are essential to enhance its accuracy and practical utility. Strategies such as feature engineering, alternative machine learning models, and the incorporation of additional data, such as lifestyle or genetic factors, could further optimize performance. Additionally, providing probability estimates for predictions could enhance clinical decision-making by identifying cases requiring further diagnostics. With these refinements, the model could become a valuable tool for reducing complications and improving outcomes in diabetes care.
References¶
Agarwal, N., & Vadiwala, R. (2022). Machine Learning and Data Mining Methods in Diabetes Research. Asian Journal of Organic & Medicinal Chemistry.
Battineni, G., Sagaro, G. G., Chinatalapudi, N., & Amenta, F. (2020). Applications of machine learning predictive models in the chronic disease diagnosis. Journal of personalized medicine, 10(2), 21.
Bini, S. A. (2018). Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care?. The Journal of arthroplasty, 33(8), 2358-2361.
Dua, D., & Graff, C. (2017). Pima Indians Diabetes Database. UCI Machine Learning Repository. Retrieved from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
McKinney, W. (2010). Data structures for statistical computing in Python. In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th Python in Science Conference (pp. 51–56).
Marshall, S. M., & Flyvbjerg, A. (2006). Prevention and early detection of vascular complications of diabetes. Bmj, 333(7566), 475-480.
Ostblom, J. (2021). altair_ally: Enhancing Altair for statistical visualization. Retrieved from https://github.com/jostblom/altair_ally
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(October), 2825–2830.
VanderPlas, J., Granger, B., Heer, J., Moritz, D., Wongsuphasawat, K., Satyanarayan, A., ... & Sievert, S. (2018). Altair: Interactive statistical visualizations for Python. Journal of Open Source Software, 3(32), 1057. https://doi.org/10.21105/joss.01057
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. Scotts Valley, CA: CreateSpace.
World Health Organization. (n.d.). Diabetes. Retrieved November 22, 2024, from https://www.who.int/news-room/fact-sheets/detail/diabetes