Applying Keras Neural Networks in Healthcare: Diabetes Detection

Applying Keras Neural Networks in Healthcare: Diabetes Detection

Author

Riley Svensson

Code
# Import needed packages and potential data science tools like pandas and numpy
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from imblearn.pipeline import Pipeline as ImblearnPipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPRegressor,MLPClassifier
from sklearn.naive_bayes import CategoricalNB, ComplementNB
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from scikeras.wrappers import KerasRegressor
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
from scikeras.wrappers import KerasRegressor, KerasClassifier
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import seaborn as sn

1 Introduction

Neural Networks are a fundamental aspect of deep learning, capable of modeling complex relationships in data. In this analysis, we aimed to predict the presence of diabetes using data from the Centers for Disease Control and Prevention (CDC), specifically the Behavioral Risk Factor Surveillance System (BRFSS). Our primary goal was to build and evaluate various neural network models to determine their effectiveness in binary classification for predicting diabetes, and compare their performance to any other best method we have utilized. To allow for a controlled comparison in usage and performance metrics of two differing models, I chose to utilize a balanced dataset with 50% of cases each having diabetes and not, which does not give an unfair advantage to methods that are well-versed in handling imbalanced like XGBoost.

2 Data Preparation

Code
# Read-in the data, naming our df in context
diabetes_data = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/archive (4)/diabetes_binary_5050split_health_indicators_BRFSS2015.csv") 

# Check for missing values before modeling, displaying no issues 
diabetes_data.isnull().sum()

# Renaming a few variables to be more descriptive of the column 
diabetes_data.rename(columns={ 'MentHlth': 'PoorGenHealth_Days', 'PhysHlth': 'PoorPhysHealth_Days',
                                'GenHlth': 'GeneralHealth_Rating'}, inplace=True)

# Define column transformer to be used for preprocessing 
ct_all = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough" 
).set_output(transform = "pandas")

# View head of the data to prepare for modeling
diabetes_data.head(5)
Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits ... AnyHealthcare NoDocbcCost GeneralHealth_Rating PoorGenHealth_Days PoorPhysHealth_Days DiffWalk Sex Age Education Income
0 0.0 1.0 0.0 1.0 26.0 0.0 0.0 0.0 1.0 0.0 ... 1.0 0.0 3.0 5.0 30.0 0.0 1.0 4.0 6.0 8.0
1 0.0 1.0 1.0 1.0 26.0 1.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 3.0 0.0 0.0 0.0 1.0 12.0 6.0 8.0
2 0.0 0.0 0.0 1.0 26.0 0.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 10.0 0.0 1.0 13.0 6.0 8.0
3 0.0 1.0 1.0 1.0 28.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 3.0 0.0 3.0 0.0 1.0 11.0 6.0 8.0
4 0.0 0.0 0.0 1.0 29.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 8.0

5 rows × 22 columns


This dataset required minimal cleaning or pre-processing due to its completeness and perfectly balanced target variable, Binary Diabetes, which allowed me to focus on understanding the effects of various features within our models. Before modeling, I sought to contextualize the growing issue of diabetes in the US, aiming to confirm hypotheses and previous findings with current data. Given the large potential feature set, it is optimal to identify and remove features that have no effect on a person’s diabetes status. To research this, I visited the CDC’s page on diabetes risk factors to confirm key influences on diabetes. The CDC warns that diabetes is more prevalent among elderly and overweight individuals, and those with family history risks (not captured in our variables). The relevant predictors in our dataset include Age, HighChol, HighBP, PhysActivity, BMI and General Health.

To understand these factors true predictive affect on if a patient has Diabetes or not, we can leverage an initial Logistic Regression model, to return the slope’s or overall impact that each predictor has on Diabetes. Logistic regression relies on the use of probabilities to predict a case of diabetes as 1 or 0, based on a percentage likelihood cutoff which is typically 0.50. Along with this logistic model, we created a few interaction terms to test for any additional significant relationships between variables like Income and HighBP, as it could be expected that high income individuals have more income in which they are able to spend on healthy food, exercise memberships, and therefore better overall health (less risk for Diabetes).


3 Logistic Regression

Code
# Drop 'Diabetes_binary' as the target column
X = diabetes_data.drop('Diabetes_binary', axis=1)
y = diabetes_data['Diabetes_binary']

# Create interaction terms
diabetes_data['Income_BMI'] = diabetes_data['Income'] * diabetes_data['BMI']
diabetes_data['Income_HighBP'] = diabetes_data['Income'] * diabetes_data['HighBP']
diabetes_data['Income_HighChol'] = diabetes_data['Income'] * diabetes_data['HighChol']


# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

# Using ct_all which is already defined
logistic_regression_pipeline = Pipeline([
    ("preprocessor", ct_all),
    ("logisticregressor", LogisticRegression(max_iter=1000, solver='lbfgs'))
])

# Define the parameter grid
param_grid = {
    'logisticregressor__C': [0.1, 1, 10], 
    'logisticregressor__penalty': ['l2'] 
}

# Tuning with GridSearchCV, using 'accuracy' as the scoring metric for classification
grid_search = GridSearchCV(logistic_regression_pipeline, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)

# Re-predict on the test set
test_predictions = grid_search.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print("Test set accuracy: ", test_accuracy)
Best parameters:  {'logisticregressor__C': 1, 'logisticregressor__penalty': 'l2'}
Best accuracy:  0.7463018708706141
Test set accuracy:  0.7506601282534893


4 Feature Effect Interpretation

Code
import statsmodels.api as sm

# Fit the logistic regression model using stats models function
model = sm.Logit(y_train, X_train).fit()

# Print the summary of the model to see the coefficients and p-values
model.summary()
Optimization terminated successfully.
         Current function value: 0.539374
         Iterations 6
Logit Regression Results
Dep. Variable: Diabetes_binary No. Observations: 49484
Model: Logit Df Residuals: 49463
Method: MLE Df Model: 20
Date: Mon, 28 Oct 2024 Pseudo R-squ.: 0.2218
Time: 12:38:59 Log-Likelihood: -26690.
converged: True LL-Null: -34300.
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
HighBP 0.8583 0.023 37.214 0.000 0.813 0.904
HighChol 0.6189 0.022 28.045 0.000 0.576 0.662
CholCheck -0.7648 0.061 -12.589 0.000 -0.884 -0.646
BMI 0.0347 0.002 22.392 0.000 0.032 0.038
Smoker -0.1245 0.022 -5.703 0.000 -0.167 -0.082
Stroke 0.1591 0.049 3.264 0.001 0.064 0.255
HeartDiseaseorAttack 0.3890 0.034 11.588 0.000 0.323 0.455
PhysActivity -0.2097 0.024 -8.583 0.000 -0.258 -0.162
Fruits -0.0784 0.023 -3.460 0.001 -0.123 -0.034
Veggies -0.1824 0.027 -6.728 0.000 -0.236 -0.129
HvyAlcoholConsump -0.8665 0.057 -15.252 0.000 -0.978 -0.755
AnyHealthcare -0.7209 0.053 -13.631 0.000 -0.825 -0.617
NoDocbcCost -0.3376 0.039 -8.760 0.000 -0.413 -0.262
GeneralHealth_Rating 0.3827 0.013 30.320 0.000 0.358 0.407
PoorGenHealth_Days -0.0103 0.001 -6.904 0.000 -0.013 -0.007
PoorPhysHealth_Days -0.0016 0.001 -1.184 0.237 -0.004 0.001
DiffWalk 0.2159 0.030 7.145 0.000 0.157 0.275
Sex 0.1762 0.022 7.965 0.000 0.133 0.220
Age 0.0721 0.004 17.197 0.000 0.064 0.080
Education -0.2620 0.011 -23.552 0.000 -0.284 -0.240
Income -0.0836 0.006 -13.961 0.000 -0.095 -0.072


In terms of factors associated with Diabetes, we can look into the summary results of our Logistic Regression model, typically utilized when predicting a binary (2 categories) target variable. The statsmodels package within Python allows us to easily refer to logistic regression summary statistics, providing additional information beyond simple slope coefficient’s, such as p-values and confidence intervals. Based on these predictive effects, we can see that every one has a significant impact on Diabetes status, when applying an alpha level or significance threshold of 0.05. After running a few different iterations, I noticed slightly differing results in terms of statistical significance within the variables CholCheck and PoorPhysicalHealthDays showing an insigificant impact at times. To account for this variability, I elected to use a higher p-value cutoff than 0.05, of 0.01, which warrented me to remove these insignifcant variables with p-values of 0.046 and 0.03. In terms of our tested interactions, all of which result in statistically significant p-values, indicating that they contribute meaningfully to the model, beyond the main effects of the individual variables.

However, it is interesting to interpret the slope coefficient of our interactions, to ensure that we can explain and udnerstand the underlying impact of each interaction on our target variable, Diabtes. The positive coefficient’s for all three interaction term’s: Income_BMI, Income_HighBP, and Income_HighChol, indicate that as both BMI and income increase, the combined effect on the likelihood of diabetes increases. This may seem counterintuitive from our hypothesis earlier, however there are several factors that could explain this relationship. First off, higher income individuals might have better access to healthcare and therefore are more likely to be diagnosed with diabetes if they have a high BMI/Chol/BP. Additionally, higher income does not necessarily equate to healthier lifestyle choices, as people with higher incomes might have more opportunities to indulge in behaviors that contribute to higher BMI/Chol/BP, such as eating out frequently or consuming high-calorie foods. Now that we understand the significant factors that contribute to diabetes, we can move into modeling with more complex solutions like Keras’ Neural Networks and Stacking Model’s to improve our accuracy of 75%.


5 Model 1: Keras Neural Network

Code
# Step 1: Preprocess the data

# Drop 'Diabetes_binary' due to it being our target column & drop insignificant predictors
X = diabetes_data.drop(['Diabetes_binary','PoorPhysHealth_Days', 'CholCheck'], axis=1)
y = diabetes_data['Diabetes_binary']

# Normalize features using a standard scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


# Ensure y_train and y_test are NumPy arrays for Keras to be able to fit our model
y2_train = np.array(y_train).astype(np.float32).ravel()  # Flatten to 1D
y2_test = np.array(y_test).astype(np.float32).ravel()

# Specify numbers of features which will be used for the input shape 
number_of_features = len(X.columns)

# Step 2: Build the model
model = Sequential([
    Input(shape=(number_of_features,)),
    Dense(12, activation='relu'),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Step 3: Compile the model
model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

# Step 4: Train the model
model.fit(X_train, y2_train, epochs=20, batch_size=10, verbose=1)

# Step 5: Evaluate the model
loss, accuracy = model.evaluate(X_test, y2_test)
print('Test accuracy:', accuracy)
Epoch 1/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 268us/step - accuracy: 0.7180 - loss: 0.5521
Epoch 2/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 257us/step - accuracy: 0.7479 - loss: 0.5102
Epoch 3/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7484 - loss: 0.5078
Epoch 4/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7508 - loss: 0.5078
Epoch 5/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7521 - loss: 0.5053
Epoch 6/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 258us/step - accuracy: 0.7533 - loss: 0.5048
Epoch 7/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 257us/step - accuracy: 0.7554 - loss: 0.5033
Epoch 8/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 260us/step - accuracy: 0.7565 - loss: 0.5008
Epoch 9/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 259us/step - accuracy: 0.7533 - loss: 0.5027
Epoch 10/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 260us/step - accuracy: 0.7527 - loss: 0.5030
Epoch 11/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 274us/step - accuracy: 0.7547 - loss: 0.5005
Epoch 12/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 268us/step - accuracy: 0.7544 - loss: 0.5005
Epoch 13/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 268us/step - accuracy: 0.7528 - loss: 0.5051
Epoch 14/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7505 - loss: 0.5042
Epoch 15/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 264us/step - accuracy: 0.7564 - loss: 0.4993
Epoch 16/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 272us/step - accuracy: 0.7536 - loss: 0.5015
Epoch 17/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 266us/step - accuracy: 0.7559 - loss: 0.4988
Epoch 18/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 263us/step - accuracy: 0.7563 - loss: 0.4998
Epoch 19/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 262us/step - accuracy: 0.7530 - loss: 0.5044
Epoch 20/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 1s 261us/step - accuracy: 0.7551 - loss: 0.5013
663/663 ━━━━━━━━━━━━━━━━━━━━ 0s 234us/step - accuracy: 0.7552 - loss: 0.5011
Test accuracy: 0.7523576021194458


6 Model 2: Keras Neural Network - Feature Selection & Tuning

Code
# Specify numbers of features which will be used for the input shape 
number_of_features = len(X.columns)

# Step 2: Build the model
model = Sequential([
    Input(shape=(number_of_features,)),
    Dense(64, activation='relu'),      # Increased neurons
    Dropout(0.3),                   
    Dense(32, activation='relu'),  
    Dropout(0.3),                     # Dropout used to apply regularization
    Dense(16, activation='relu'),   
    Dropout(0.3),                      # Added more dense layer's
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Step 3: Compile the model
model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

# Step 4: Train the model
model.fit(X_train, y_train, epochs=20, batch_size=10, verbose=1)

# Step 5: Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print('Test accuracy:', accuracy)
Epoch 1/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 371us/step - accuracy: 0.7087 - loss: 0.5708
Epoch 2/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 343us/step - accuracy: 0.7474 - loss: 0.5227
Epoch 3/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 346us/step - accuracy: 0.7469 - loss: 0.5198
Epoch 4/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 344us/step - accuracy: 0.7431 - loss: 0.5169
Epoch 5/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 366us/step - accuracy: 0.7423 - loss: 0.5207
Epoch 6/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 346us/step - accuracy: 0.7473 - loss: 0.5121
Epoch 7/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7531 - loss: 0.5107
Epoch 8/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 349us/step - accuracy: 0.7468 - loss: 0.5139
Epoch 9/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 352us/step - accuracy: 0.7487 - loss: 0.5143
Epoch 10/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 360us/step - accuracy: 0.7537 - loss: 0.5099
Epoch 11/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 342us/step - accuracy: 0.7490 - loss: 0.5127
Epoch 12/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7489 - loss: 0.5128
Epoch 13/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 342us/step - accuracy: 0.7490 - loss: 0.5131
Epoch 14/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 344us/step - accuracy: 0.7533 - loss: 0.5072
Epoch 15/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7526 - loss: 0.5086
Epoch 16/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7488 - loss: 0.5105
Epoch 17/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 343us/step - accuracy: 0.7538 - loss: 0.5069
Epoch 18/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 345us/step - accuracy: 0.7567 - loss: 0.5034
Epoch 19/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 347us/step - accuracy: 0.7492 - loss: 0.5117
Epoch 20/20
4949/4949 ━━━━━━━━━━━━━━━━━━━━ 2s 353us/step - accuracy: 0.7477 - loss: 0.5097
663/663 ━━━━━━━━━━━━━━━━━━━━ 0s 242us/step - accuracy: 0.7539 - loss: 0.5106
Test accuracy: 0.7512730956077576


Following the removal of insignificant variables found during exploratory logistic regression and the addition of more neural network layers, parameters and regularization with dropout, our accuracy rate of 75% still showed almost no change suggesting a maximum. Even then, an accuracy of above 70% in practical data sources suggest a highly successful model, that can have many usages within hospital, personal, and even insurance contexts.


7 Model 3 - Stacked Model

  1. Base Models - XgBoost, AdaBoost, MLPClassifier, Random Forest
  2. Final Meta-Model - Logistic Regression

In this ensemble approach, we have built the following base models (XgBoost, AdaBoost, MLPClassifier, and Random Forest) each contribute to making initial predictions. These models are selected for their diverse strengths in handling different types of data and capturing various patterns. The predictions from these base models are then used as inputs for the final meta-model, which in this case is Logistic Regression. The Logistic Regression model synthesizes these inputs to produce a final prediction, leveraging the combined knowledge of the base models to enhance overall performance and robustness.


Code
# Drop 'Diabetes_binary' due to it being our target column & drop insignificant predictors
X = diabetes_data.drop(['Diabetes_binary', 'PoorPhysHealth_Days', 'CholCheck'], axis=1)
y = diabetes_data['Diabetes_binary']

# Define column transformer to be used for preprocessing
ct_all = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
  ],
  remainder="passthrough"
).set_output(transform="pandas")


# Apply the column transformer to the entire dataset
X_preprocessed = ct_all.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.3, random_state=42)

# Define base models within the preprocessing pipeline and tuned parameteres with usual best performers
base_models = [
    ('gradient_boosting', Pipeline([
        ('classifier', GradientBoostingClassifier(learning_rate=0.1, max_depth=4, n_estimators=300))
    ])),
    ('xgbboost', Pipeline([
        ('classifier', XGBClassifier(learning_rate = 0.3, max_depth = 4, n_estimators = 300))
    ])), 
    ('mlp_classifier', Pipeline([
        ('preprocessor', ct_all),
        ('classifier', MLPClassifier(activation = 'relu', hidden_layer_sizes = (100,), 
                                     learning_rate = 'constant', solver = 'adam', max_iter = 1000,
                                     early_stopping = True))
    ])),
    ('ada_boost', Pipeline([
        ('classifier', AdaBoostClassifier(algorithm ='SAMME', n_estimators = 300,
                                          learning_rate = 0.1 ))
     ])),
    ('random_forest', Pipeline([
        ('classifier', RandomForestClassifier(max_depth = None, min_samples_leaf = 1,
                                              n_estimators = 150))
    ]))
]

# Define the final regressor method, typically Linear or Logistic, in this case for classification use Log
final_regressor = LogisticRegression()

# Create the stacking ensemble
stacking_classifier = StackingClassifier(estimators=base_models, final_estimator=final_regressor, cv=5)

# Fit the stacking model on train data
stacking_classifier.fit(X_train, y_train)

# Predict/evaluate the model on Test data
stacking_predictions = stacking_classifier.predict(X_test)
stacking_accuracy = accuracy_score(y_test, stacking_predictions)

# Stacking model performance including final feature set
print("Stacking Model Test Accuracy:", stacking_accuracy)
Stacking Model Test Accuracy: 0.7531591852131271
Code
import os 

# Generate the confusion matrix, using labels from the classifier
labels = stacking_classifier.classes_ 
cm = confusion_matrix(y_test, stacking_predictions, labels=labels)

# Ensure the output directory exists
output_dir = "new_files/figure-html"
os.makedirs(output_dir, exist_ok=True)

# Plotting the confusion matrix using Seaborn
plt.figure(figsize=(10, 7))
sn.heatmap(cm, annot=True, cmap='Blues', cbar=True, fmt='d',
            xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix for Stacking Classifier')
plt.xlabel('Predicted Labels')
plt.ylabel('Actual Labels')

# Save the plot to the correct folder
plt.savefig(f"{output_dir}/cell-9-output-1.png", bbox_inches='tight')
plt.show()

8 Discussion of Results

Its no suprise that the Stacking Model achieved the highest a test accuracy of 75.33%, while both the Keras Neural Network and Logistic Regression models came in closely behind with a test accuracy of 75.30% and 75%. However this was far closer in performance that we typically see between methods, and in this case the final meta-model was unable to learn significantly more than even a simple logistic regression. These consistent results were expected with the clean and well-preprocessed data, alowing us to effectively capture the underlying patterns related to diabetes prediction. When examining the factors associated with diabetes, the Logistic Regression model revealed some interesting insights. Factors positively associated with diabetes include high blood pressure, high cholesterol, stroke history, heart disease, poor general health rating, difficulty walking, being male, and older age, mostly all of which are obvious aside from gender. On the flip side, variables negatively associated with diabetes are BMI, physical activity, fruit and vegetable consumption, access to healthcare, education, and higher income. The remaining factors within the model, although statistically significant, were likely unaffecting of having diabetes, while the features discussed above had high z-scores as well, meaning that they have a large effect. It was surprising to find that being a smoker has seemingly no effect on diabetes status, along with poor physical health days which could be random or self-inflicted.

Once again, the interaction terms (Income_BMI, Income_HighBP, and Income_HighChol) showed a positive association with diabetes, which suggests that as both BMI and income increase, the likelihood of diabetes also rises. This might seem counterintuitive initially, but makes sense when you consider that higher-income individuals might have better access to healthcare, leading to more frequent diabetes diagnoses if they have high BMI, cholesterol, or blood pressure. Additionally, people with higher incomes might indulge more in eating out and consuming high-calorie foods, contributing to higher BMI, cholesterol, or blood pressure, and thus increasing their risk of diabetes. Finally, when inspecting the final confusion matrix of our stacking model predictions, we are able to visualize our success, correctly predicting 8,405 diabetic cases and 7,572 non-diabetics. In terms of the cost of misclassifications, our model aims at minimizing the false negatives in which we predict someone to be clear of diabetes risk, when they will eventually develop it or have it already requiring treatment. In this case, our false negatives are 2,202 which is a large value but in context serves as our smallest cases of a prediction group (TP, FP, TN, FN) which is a good sign. We incorrectly predicted 3,029 people to have diabetes when they did not, however this likely means that these individuals share similar traits as diabetic’s, indicating that they should be monitored and practice the negatively associated factors above that can be controlled.

8.1 Sources

Diabetes Risk Factors - CDC