Applying Keras Neural Networks in Healthcare: Diabetes Detection
Code
# Import needed packages and potential data science tools like pandas and numpy
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from imblearn.pipeline import Pipeline as ImblearnPipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPRegressor,MLPClassifier
from sklearn.naive_bayes import CategoricalNB, ComplementNB
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from scikeras.wrappers import KerasRegressor
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
from scikeras.wrappers import KerasRegressor, KerasClassifier
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import seaborn as sn
1 Introduction
Neural Networks are a fundamental aspect of deep learning, capable of modeling complex relationships in data. In this analysis, we aimed to predict the presence of diabetes using data from the Centers for Disease Control and Prevention (CDC), specifically the Behavioral Risk Factor Surveillance System (BRFSS). Our primary goal was to build and evaluate various neural network models to determine their effectiveness in binary classification for predicting diabetes, and compare their performance to any other best method we have utilized. To allow for a controlled comparison in usage and performance metrics of two differing models, I chose to utilize a balanced dataset with 50% of cases each having diabetes and not, which does not give an unfair advantage to methods that are well-versed in handling imbalanced like XGBoost.
2 Data Preparation
Code
# Read-in the data, naming our df in context
= pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/archive (4)/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
diabetes_data
# Check for missing values before modeling, displaying no issues
sum()
diabetes_data.isnull().
# Renaming a few variables to be more descriptive of the column
={ 'MentHlth': 'PoorGenHealth_Days', 'PhysHlth': 'PoorPhysHealth_Days',
diabetes_data.rename(columns'GenHlth': 'GeneralHealth_Rating'}, inplace=True)
# Define column transformer to be used for preprocessing
= ColumnTransformer(
ct_all
["dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
(
],= "passthrough"
remainder = "pandas")
).set_output(transform
# View head of the data to prepare for modeling
5) diabetes_data.head(
Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GeneralHealth_Rating | PoorGenHealth_Days | PoorPhysHealth_Days | DiffWalk | Sex | Age | Education | Income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5.0 | 30.0 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
1 | 0.0 | 1.0 | 1.0 | 1.0 | 26.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
2 | 0.0 | 0.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 10.0 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
3 | 0.0 | 1.0 | 1.0 | 1.0 | 28.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 29.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
5 rows × 22 columns
This dataset required minimal cleaning or pre-processing due to its completeness and perfectly balanced target variable, Binary Diabetes, which allowed me to focus on understanding the effects of various features within our models. Before modeling, I sought to contextualize the growing issue of diabetes in the US, aiming to confirm hypotheses and previous findings with current data. Given the large potential feature set, it is optimal to identify and remove features that have no effect on a person’s diabetes status. To research this, I visited the CDC’s page on diabetes risk factors to confirm key influences on diabetes. The CDC warns that diabetes is more prevalent among elderly and overweight individuals, and those with family history risks (not captured in our variables). The relevant predictors in our dataset include Age
, HighChol
, HighBP
, PhysActivity
, BMI
and General Health
.
To understand these factors true predictive affect on if a patient has Diabetes or not, we can leverage an initial Logistic Regression model, to return the slope’s or overall impact that each predictor has on Diabetes. Logistic regression relies on the use of probabilities to predict a case of diabetes as 1 or 0, based on a percentage likelihood cutoff which is typically 0.50. Along with this logistic model, we created a few interaction terms to test for any additional significant relationships between variables like Income and HighBP, as it could be expected that high income individuals have more income in which they are able to spend on healthy food, exercise memberships, and therefore better overall health (less risk for Diabetes).
3 Logistic Regression
Code
# Drop 'Diabetes_binary' as the target column
= diabetes_data.drop('Diabetes_binary', axis=1)
X = diabetes_data['Diabetes_binary']
y
# Create interaction terms
'Income_BMI'] = diabetes_data['Income'] * diabetes_data['BMI']
diabetes_data['Income_HighBP'] = diabetes_data['Income'] * diabetes_data['HighBP']
diabetes_data['Income_HighChol'] = diabetes_data['Income'] * diabetes_data['HighChol']
diabetes_data[
# Split data into training and testing sets
= train_test_split(X, y, test_size=0.3, random_state=3)
X_train, X_test, y_train, y_test
# Using ct_all which is already defined
= Pipeline([
logistic_regression_pipeline "preprocessor", ct_all),
("logisticregressor", LogisticRegression(max_iter=1000, solver='lbfgs'))
(
])
# Define the parameter grid
= {
param_grid 'logisticregressor__C': [0.1, 1, 10],
'logisticregressor__penalty': ['l2']
}
# Tuning with GridSearchCV, using 'accuracy' as the scoring metric for classification
= GridSearchCV(logistic_regression_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search
grid_search.fit(X_train, y_train)
# Best parameters and score
print("Best parameters: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)
# Re-predict on the test set
= grid_search.predict(X_test)
test_predictions = accuracy_score(y_test, test_predictions)
test_accuracy print("Test set accuracy: ", test_accuracy)
Best parameters: {'logisticregressor__C': 1, 'logisticregressor__penalty': 'l2'}
Best accuracy: 0.7463018708706141
Test set accuracy: 0.7506601282534893
4 Feature Effect Interpretation
Code
import statsmodels.api as sm
# Fit the logistic regression model using stats models function
= sm.Logit(y_train, X_train).fit()
model
# Print the summary of the model to see the coefficients and p-values
model.summary()
Optimization terminated successfully.
Current function value: 0.539374
Iterations 6
Dep. Variable: | Diabetes_binary | No. Observations: | 49484 |
---|---|---|---|
Model: | Logit | Df Residuals: | 49463 |
Method: | MLE | Df Model: | 20 |
Date: | Mon, 28 Oct 2024 | Pseudo R-squ.: | 0.2218 |
Time: | 12:38:59 | Log-Likelihood: | -26690. |
converged: | True | LL-Null: | -34300. |
Covariance Type: | nonrobust | LLR p-value: | 0.000 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
HighBP | 0.8583 | 0.023 | 37.214 | 0.000 | 0.813 | 0.904 |
HighChol | 0.6189 | 0.022 | 28.045 | 0.000 | 0.576 | 0.662 |
CholCheck | -0.7648 | 0.061 | -12.589 | 0.000 | -0.884 | -0.646 |
BMI | 0.0347 | 0.002 | 22.392 | 0.000 | 0.032 | 0.038 |
Smoker | -0.1245 | 0.022 | -5.703 | 0.000 | -0.167 | -0.082 |
Stroke | 0.1591 | 0.049 | 3.264 | 0.001 | 0.064 | 0.255 |
HeartDiseaseorAttack | 0.3890 | 0.034 | 11.588 | 0.000 | 0.323 | 0.455 |
PhysActivity | -0.2097 | 0.024 | -8.583 | 0.000 | -0.258 | -0.162 |
Fruits | -0.0784 | 0.023 | -3.460 | 0.001 | -0.123 | -0.034 |
Veggies | -0.1824 | 0.027 | -6.728 | 0.000 | -0.236 | -0.129 |
HvyAlcoholConsump | -0.8665 | 0.057 | -15.252 | 0.000 | -0.978 | -0.755 |
AnyHealthcare | -0.7209 | 0.053 | -13.631 | 0.000 | -0.825 | -0.617 |
NoDocbcCost | -0.3376 | 0.039 | -8.760 | 0.000 | -0.413 | -0.262 |
GeneralHealth_Rating | 0.3827 | 0.013 | 30.320 | 0.000 | 0.358 | 0.407 |
PoorGenHealth_Days | -0.0103 | 0.001 | -6.904 | 0.000 | -0.013 | -0.007 |
PoorPhysHealth_Days | -0.0016 | 0.001 | -1.184 | 0.237 | -0.004 | 0.001 |
DiffWalk | 0.2159 | 0.030 | 7.145 | 0.000 | 0.157 | 0.275 |
Sex | 0.1762 | 0.022 | 7.965 | 0.000 | 0.133 | 0.220 |
Age | 0.0721 | 0.004 | 17.197 | 0.000 | 0.064 | 0.080 |
Education | -0.2620 | 0.011 | -23.552 | 0.000 | -0.284 | -0.240 |
Income | -0.0836 | 0.006 | -13.961 | 0.000 | -0.095 | -0.072 |
In terms of factors associated with Diabetes, we can look into the summary results of our Logistic Regression model, typically utilized when predicting a binary (2 categories) target variable. The statsmodels package within Python allows us to easily refer to logistic regression summary statistics, providing additional information beyond simple slope coefficient’s, such as p-values and confidence intervals. Based on these predictive effects, we can see that every one has a significant impact on Diabetes status, when applying an alpha level or significance threshold of 0.05. After running a few different iterations, I noticed slightly differing results in terms of statistical significance within the variables CholCheck
and PoorPhysicalHealthDays
showing an insigificant impact at times. To account for this variability, I elected to use a higher p-value cutoff than 0.05, of 0.01, which warrented me to remove these insignifcant variables with p-values of 0.046 and 0.03. In terms of our tested interactions, all of which result in statistically significant p-values, indicating that they contribute meaningfully to the model, beyond the main effects of the individual variables.
However, it is interesting to interpret the slope coefficient of our interactions, to ensure that we can explain and udnerstand the underlying impact of each interaction on our target variable, Diabtes. The positive coefficient’s for all three interaction term’s: Income_BMI
, Income_HighBP
, and Income_HighChol
, indicate that as both BMI and income increase, the combined effect on the likelihood of diabetes increases. This may seem counterintuitive from our hypothesis earlier, however there are several factors that could explain this relationship. First off, higher income individuals might have better access to healthcare and therefore are more likely to be diagnosed with diabetes if they have a high BMI/Chol/BP. Additionally, higher income does not necessarily equate to healthier lifestyle choices, as people with higher incomes might have more opportunities to indulge in behaviors that contribute to higher BMI/Chol/BP, such as eating out frequently or consuming high-calorie foods. Now that we understand the significant factors that contribute to diabetes, we can move into modeling with more complex solutions like Keras’ Neural Networks and Stacking Model’s to improve our accuracy of 75%.
5 Model 1: Keras Neural Network
Code
# Step 1: Preprocess the data
# Drop 'Diabetes_binary' due to it being our target column & drop insignificant predictors
= diabetes_data.drop(['Diabetes_binary','PoorPhysHealth_Days', 'CholCheck'], axis=1)
X = diabetes_data['Diabetes_binary']
y
# Normalize features using a standard scaler
= StandardScaler()
scaler = scaler.fit_transform(X)
X_scaled
# Split data into training and testing sets
= train_test_split(X_scaled, y, test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test
# Ensure y_train and y_test are NumPy arrays for Keras to be able to fit our model
= np.array(y_train).astype(np.float32).ravel() # Flatten to 1D
y2_train = np.array(y_test).astype(np.float32).ravel()
y2_test
# Specify numbers of features which will be used for the input shape
= len(X.columns)
number_of_features
# Step 2: Build the model
= Sequential([
model =(number_of_features,)),
Input(shape12, activation='relu'),
Dense(8, activation='relu'),
Dense(1, activation='sigmoid')
Dense(
])
# Step 3: Compile the model
compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.
# Step 4: Train the model
=20, batch_size=10, verbose=1)
model.fit(X_train, y2_train, epochs
# Step 5: Evaluate the model
= model.evaluate(X_test, y2_test)
loss, accuracy print('Test accuracy:', accuracy)
Epoch 1/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 268us/step - accuracy: 0.7180 - loss: 0.5521
Epoch 2/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 257us/step - accuracy: 0.7479 - loss: 0.5102
Epoch 3/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7484 - loss: 0.5078
Epoch 4/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7508 - loss: 0.5078
Epoch 5/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7521 - loss: 0.5053
Epoch 6/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 258us/step - accuracy: 0.7533 - loss: 0.5048
Epoch 7/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 257us/step - accuracy: 0.7554 - loss: 0.5033
Epoch 8/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 260us/step - accuracy: 0.7565 - loss: 0.5008
Epoch 9/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 259us/step - accuracy: 0.7533 - loss: 0.5027
Epoch 10/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 260us/step - accuracy: 0.7527 - loss: 0.5030
Epoch 11/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 274us/step - accuracy: 0.7547 - loss: 0.5005
Epoch 12/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 268us/step - accuracy: 0.7544 - loss: 0.5005
Epoch 13/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 268us/step - accuracy: 0.7528 - loss: 0.5051
Epoch 14/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7505 - loss: 0.5042
Epoch 15/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 264us/step - accuracy: 0.7564 - loss: 0.4993
Epoch 16/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 272us/step - accuracy: 0.7536 - loss: 0.5015
Epoch 17/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 266us/step - accuracy: 0.7559 - loss: 0.4988
Epoch 18/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 263us/step - accuracy: 0.7563 - loss: 0.4998
Epoch 19/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7530 - loss: 0.5044
Epoch 20/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7551 - loss: 0.5013
663/663 ━━━━━━━━━━━━━━━━━━━━ 0s 234us/step - accuracy: 0.7552 - loss: 0.5011
Test accuracy: 0.7523576021194458
6 Model 2: Keras Neural Network - Feature Selection & Tuning
Code
# Specify numbers of features which will be used for the input shape
= len(X.columns)
number_of_features
# Step 2: Build the model
= Sequential([
model =(number_of_features,)),
Input(shape64, activation='relu'), # Increased neurons
Dense(0.3),
Dropout(32, activation='relu'),
Dense(0.3), # Dropout used to apply regularization
Dropout(16, activation='relu'),
Dense(0.3), # Added more dense layer's
Dropout(8, activation='relu'),
Dense(1, activation='sigmoid')
Dense(
])
# Step 3: Compile the model
compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.
# Step 4: Train the model
=20, batch_size=10, verbose=1)
model.fit(X_train, y_train, epochs
# Step 5: Evaluate the model
= model.evaluate(X_test, y_test)
loss, accuracy print('Test accuracy:', accuracy)
Epoch 1/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 371us/step - accuracy: 0.7087 - loss: 0.5708
Epoch 2/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 343us/step - accuracy: 0.7474 - loss: 0.5227
Epoch 3/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 346us/step - accuracy: 0.7469 - loss: 0.5198
Epoch 4/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 344us/step - accuracy: 0.7431 - loss: 0.5169
Epoch 5/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 366us/step - accuracy: 0.7423 - loss: 0.5207
Epoch 6/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 346us/step - accuracy: 0.7473 - loss: 0.5121
Epoch 7/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7531 - loss: 0.5107
Epoch 8/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 349us/step - accuracy: 0.7468 - loss: 0.5139
Epoch 9/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 352us/step - accuracy: 0.7487 - loss: 0.5143
Epoch 10/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 360us/step - accuracy: 0.7537 - loss: 0.5099
Epoch 11/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 342us/step - accuracy: 0.7490 - loss: 0.5127
Epoch 12/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7489 - loss: 0.5128
Epoch 13/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 342us/step - accuracy: 0.7490 - loss: 0.5131
Epoch 14/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 344us/step - accuracy: 0.7533 - loss: 0.5072
Epoch 15/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7526 - loss: 0.5086
Epoch 16/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7488 - loss: 0.5105
Epoch 17/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 343us/step - accuracy: 0.7538 - loss: 0.5069
Epoch 18/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7567 - loss: 0.5034
Epoch 19/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7492 - loss: 0.5117
Epoch 20/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 353us/step - accuracy: 0.7477 - loss: 0.5097
663/663 ━━━━━━━━━━━━━━━━━━━━ 0s 242us/step - accuracy: 0.7539 - loss: 0.5106
Test accuracy: 0.7512730956077576
Following the removal of insignificant variables found during exploratory logistic regression and the addition of more neural network layers, parameters and regularization with dropout, our accuracy rate of 75% still showed almost no change suggesting a maximum. Even then, an accuracy of above 70% in practical data sources suggest a highly successful model, that can have many usages within hospital, personal, and even insurance contexts.
7 Model 3 - Stacked Model
- Base Models - XgBoost, AdaBoost, MLPClassifier, Random Forest
- Final Meta-Model - Logistic Regression
In this ensemble approach, we have built the following base models (XgBoost, AdaBoost, MLPClassifier, and Random Forest) each contribute to making initial predictions. These models are selected for their diverse strengths in handling different types of data and capturing various patterns. The predictions from these base models are then used as inputs for the final meta-model, which in this case is Logistic Regression. The Logistic Regression model synthesizes these inputs to produce a final prediction, leveraging the combined knowledge of the base models to enhance overall performance and robustness.
Code
# Drop 'Diabetes_binary' due to it being our target column & drop insignificant predictors
= diabetes_data.drop(['Diabetes_binary', 'PoorPhysHealth_Days', 'CholCheck'], axis=1)
X = diabetes_data['Diabetes_binary']
y
# Define column transformer to be used for preprocessing
= ColumnTransformer(
ct_all
["dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
(
],="passthrough"
remainder="pandas")
).set_output(transform
# Apply the column transformer to the entire dataset
= ct_all.fit_transform(X)
X_preprocessed
# Split data into training and testing sets
= train_test_split(X_preprocessed, y, test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test
# Define base models within the preprocessing pipeline and tuned parameteres with usual best performers
= [
base_models 'gradient_boosting', Pipeline([
('classifier', GradientBoostingClassifier(learning_rate=0.1, max_depth=4, n_estimators=300))
(
])),'xgbboost', Pipeline([
('classifier', XGBClassifier(learning_rate = 0.3, max_depth = 4, n_estimators = 300))
(
])), 'mlp_classifier', Pipeline([
('preprocessor', ct_all),
('classifier', MLPClassifier(activation = 'relu', hidden_layer_sizes = (100,),
(= 'constant', solver = 'adam', max_iter = 1000,
learning_rate = True))
early_stopping
])),'ada_boost', Pipeline([
('classifier', AdaBoostClassifier(algorithm ='SAMME', n_estimators = 300,
(= 0.1 ))
learning_rate
])),'random_forest', Pipeline([
('classifier', RandomForestClassifier(max_depth = None, min_samples_leaf = 1,
(= 150))
n_estimators
]))
]
# Define the final regressor method, typically Linear or Logistic, in this case for classification use Log
= LogisticRegression()
final_regressor
# Create the stacking ensemble
= StackingClassifier(estimators=base_models, final_estimator=final_regressor, cv=5)
stacking_classifier
# Fit the stacking model on train data
stacking_classifier.fit(X_train, y_train)
# Predict/evaluate the model on Test data
= stacking_classifier.predict(X_test)
stacking_predictions = accuracy_score(y_test, stacking_predictions)
stacking_accuracy
# Stacking model performance including final feature set
print("Stacking Model Test Accuracy:", stacking_accuracy)
Stacking Model Test Accuracy: 0.7531591852131271
Code
import os
# Generate the confusion matrix, using labels from the classifier
= stacking_classifier.classes_
labels = confusion_matrix(y_test, stacking_predictions, labels=labels)
cm
# Ensure the output directory exists
= "new_files/figure-html"
output_dir =True)
os.makedirs(output_dir, exist_ok
# Plotting the confusion matrix using Seaborn
=(10, 7))
plt.figure(figsize=True, cmap='Blues', cbar=True, fmt='d',
sn.heatmap(cm, annot=labels, yticklabels=labels)
xticklabels'Confusion Matrix for Stacking Classifier')
plt.title('Predicted Labels')
plt.xlabel('Actual Labels')
plt.ylabel(
# Save the plot to the correct folder
f"{output_dir}/cell-9-output-1.png", bbox_inches='tight')
plt.savefig( plt.show()
8 Discussion of Results
Its no suprise that the Stacking Model achieved the highest a test accuracy of 75.33%, while both the Keras Neural Network and Logistic Regression models came in closely behind with a test accuracy of 75.30% and 75%. However this was far closer in performance that we typically see between methods, and in this case the final meta-model was unable to learn significantly more than even a simple logistic regression. These consistent results were expected with the clean and well-preprocessed data, alowing us to effectively capture the underlying patterns related to diabetes prediction. When examining the factors associated with diabetes, the Logistic Regression model revealed some interesting insights. Factors positively associated with diabetes include high blood pressure, high cholesterol, stroke history, heart disease, poor general health rating, difficulty walking, being male, and older age, mostly all of which are obvious aside from gender. On the flip side, variables negatively associated with diabetes are BMI, physical activity, fruit and vegetable consumption, access to healthcare, education, and higher income. The remaining factors within the model, although statistically significant, were likely unaffecting of having diabetes, while the features discussed above had high z-scores as well, meaning that they have a large effect. It was surprising to find that being a smoker has seemingly no effect on diabetes status, along with poor physical health days which could be random or self-inflicted.
Once again, the interaction terms (Income_BMI
, Income_HighBP
, and Income_HighChol
) showed a positive association with diabetes, which suggests that as both BMI and income increase, the likelihood of diabetes also rises. This might seem counterintuitive initially, but makes sense when you consider that higher-income individuals might have better access to healthcare, leading to more frequent diabetes diagnoses if they have high BMI, cholesterol, or blood pressure. Additionally, people with higher incomes might indulge more in eating out and consuming high-calorie foods, contributing to higher BMI, cholesterol, or blood pressure, and thus increasing their risk of diabetes. Finally, when inspecting the final confusion matrix of our stacking model predictions, we are able to visualize our success, correctly predicting 8,405 diabetic cases and 7,572 non-diabetics. In terms of the cost of misclassifications, our model aims at minimizing the false negatives in which we predict someone to be clear of diabetes risk, when they will eventually develop it or have it already requiring treatment. In this case, our false negatives are 2,202 which is a large value but in context serves as our smallest cases of a prediction group (TP, FP, TN, FN) which is a good sign. We incorrectly predicted 3,029 people to have diabetes when they did not, however this likely means that these individuals share similar traits as diabetic’s, indicating that they should be monitored and practice the negatively associated factors above that can be controlled.