Utilizing Boosting to Identify Heart Disease
Introduction
The primary goal of this machine learning project is to identify the presence of heart disease in patients using boosting techniques. Boosting is a powerful ensemble method that combines multiple weak learners to create a strong predictive model. By leveraging this approach, we aim to enhance the accuracy and robustness of our heart disease prediction model.
The importance of an analysis like this cannot be overstated. Heart disease remains one of the leading causes of mortality worldwide, and early detection is crucial for effective treatment and prevention. Developing a reliable and accurate predictive model can have significant implications for public health by enabling early detection and intervention, efficiently allocating medical resources, creating personalized treatment plans, advancing medical research, and reducing healthcare costs.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from sklearn.preprocessing import LabelEncoder
from imblearn.pipeline import Pipeline as ImblearnPipeline
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
import warnings
Read-In Data
# View a head of the data to find out about column naming issues, and missing values
= pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /archive (4) copy/2020/heart_2020_cleaned.csv")
heart_2020 = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /archive (4) copy/2022/heart_2022_no_nans.csv")
heart_2022
# We can see that 2020 had a well-prepared feature set, while 2022 includes some additional variables
10) heart_2020.head(
HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | No | 16.60 | Yes | No | No | 3.0 | 30.0 | No | Female | 55-59 | White | Yes | Yes | Very good | 5.0 | Yes | No | Yes |
1 | No | 20.34 | No | No | Yes | 0.0 | 0.0 | No | Female | 80 or older | White | No | Yes | Very good | 7.0 | No | No | No |
2 | No | 26.58 | Yes | No | No | 20.0 | 30.0 | No | Male | 65-69 | White | Yes | Yes | Fair | 8.0 | Yes | No | No |
3 | No | 24.21 | No | No | No | 0.0 | 0.0 | No | Female | 75-79 | White | No | No | Good | 6.0 | No | No | Yes |
4 | No | 23.71 | No | No | No | 28.0 | 0.0 | Yes | Female | 40-44 | White | No | Yes | Very good | 8.0 | No | No | No |
5 | Yes | 28.87 | Yes | No | No | 6.0 | 0.0 | Yes | Female | 75-79 | Black | No | No | Fair | 12.0 | No | No | No |
6 | No | 21.63 | No | No | No | 15.0 | 0.0 | No | Female | 70-74 | White | No | Yes | Fair | 4.0 | Yes | No | Yes |
7 | No | 31.64 | Yes | No | No | 5.0 | 0.0 | Yes | Female | 80 or older | White | Yes | No | Good | 9.0 | Yes | No | No |
8 | No | 26.45 | No | No | No | 0.0 | 0.0 | No | Female | 80 or older | White | No, borderline diabetes | No | Fair | 5.0 | No | Yes | No |
9 | No | 40.69 | No | No | No | 0.0 | 0.0 | Yes | Male | 65-69 | White | No | Yes | Good | 10.0 | No | No | No |
Data Preparation:
- Cleaning 2020 data
We began the data preparation process by thoroughly cleaning the 2020 data. This step was essential to ensure the accuracy and reliability of our machine learning model.
- Renaming Variables within the 2022 set
Next, we renamed variables within the 2022 dataset to maintain consistency across different years. This renaming helps in aligning the datasets and simplifies further analysis.
- Drop Additional Columns from 2022 to ensure matching feature sets
To ensure matching feature sets between the two datasets, we dropped additional columns from the 2022 data. This step was crucial for creating a unified dataset for our model.
- Feature Selection
Cleaning Heart 2020 Data
# For the diabetes, column there are cases with extra information beyond the "Yes" and "No", and we want consistent binary selections
= heart_2020['Diabetic'].value_counts()
diabetes_value_counts
diabetes_value_counts
# Define a dictionary for the changes being made
= {
changes_mapped 'No, borderline diabetes': 'No',
'Yes (during pregnancy)': 'Yes'
}
# Apply the replacements to the 'Diabetic' column
'Diabetic'] = heart_2020['Diabetic'].replace(changes_mapped) heart_2020[
Column Renaming
# Before working with these dataframes, I wanted to ensure the columns and their names matched
= heart_2022.drop(['State','LastCheckupTime',"RemovedTeeth","TetanusLast10Tdap","LastCheckupTime","HeightInMeters","WeightInKilograms","HIVTesting","FluVaxLast12",
heart_2022_copy "PneumoVaxEver","TetanusLast10Tdap","HighRiskLastYear","CovidPos","HadAngina","HadCOPD",
"HadDepressiveDisorder","DifficultyDressingBathing","DifficultyErrands","BlindOrVisionDifficulty",
"DifficultyConcentrating","DeafOrHardOfHearing","ECigaretteUsage","HadArthritis","ChestScan"], axis = 1)
heart_2022_copy
# Rename columns to be the same as heart 2020 dataset, for usability and easier understanding of results
={'GeneralHealth': 'GenHealth',
heart_2022_copy.rename(columns'PhysicalHealthDays': 'PhysicalHealth',
'MentalHealthDays': 'MentalHealth',
'SleepHours': 'SleepTime',
'PhysicalActivities':'PhysicalActivity',
'HadStroke':'Stroke',
'HadAsthma':'Asthma',
'HadSkinCancer':'SkinCancer',
'HadKidneyDisease': 'KidneyDisease',
'HadDiabetes':'Diabetic',
'DifficultyWalking':'DiffWalking',
'SmokerStatus':'Smoking',
'AlcoholDrinkers':'AlcoholDrinking'}, inplace=True)
Variable Cleaning
In terms of variable recoding:
- Re-classifying the Race Variable: We re-classified the race variable to be less informative by consolidating it into a single race category, as I aimed to minimize its impact and promote a more equitable analysis
- Evaluating the Smoking Status: Initially, we considered dummifying the smoking status to “Yes” or “No.” However, after inspecting the data, we decided that keeping the current smoking status provided more valuable information to our model, so we left it unchanged.
# Build a function that works like a regular expression to find if certain races appear in the column,
# and removes any additional information that is not neccesarily needed for our analysis
def classify_race(df):
'Race'] = df['RaceEthnicityCategory'].apply(
df[lambda x: 'White' if 'White only' in x
else ('Black' if 'Black only' in x
else ('Multiracial' if 'Multiracial' in x
else ('Other' if 'Other race' in x else x)))
)return df
# Apply the function to our Heart 2022 dataframe
= classify_race(heart_2022_copy)
heart_2022_copy
# Inspect the value counts of each race to ensure our function correctly applied the classification
= heart_2022_copy['RaceEthnicityCategory'].value_counts()
original_race_counts print(original_race_counts)
# Viewing the new and original race counts, we can see that they are the same meaning the function worked
= heart_2022_copy['Race'].value_counts()
new_race_counts print('\n',new_race_counts)
# Drop the RaceEthnicityCategory column from the df as it's data is now in the Race column
= heart_2022_copy.drop(["RaceEthnicityCategory"], axis = 1) heart_2022_final
White only, Non-Hispanic 186336
Hispanic 22570
Black only, Non-Hispanic 19330
Other race only, Non-Hispanic 12205
Multiracial, Non-Hispanic 5581
Name: RaceEthnicityCategory, dtype: int64
White 186336
Hispanic 22570
Black 19330
Other 12205
Multiracial 5581
Name: Race, dtype: int64
Training the Model: Heart Disease Data 2020
Here, “training” refers to the initial feature selection process, as we performed model validation (training and testing) on two models from 2020: XGBoost and Ada Boosted Decision Tree. Initially, I considered training solely on the 2020 data and using 2022 as a test set. However, I chose to first investigate the 2020 model’s test error before applying the model to completely unseen data from 2022. This approach ensures the model’s relevance to current data. By doing this, if certain factors are no longer significant in 2022, I can account for this within the 2022 model and denote which factors have lost their importance
5) heart_2020.head(
HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | No | 16.60 | Yes | No | No | 3.0 | 30.0 | No | Female | 55-59 | White | Yes | Yes | Very good | 5.0 | Yes | No | Yes |
1 | No | 20.34 | No | No | Yes | 0.0 | 0.0 | No | Female | 80 or older | White | No | Yes | Very good | 7.0 | No | No | No |
2 | No | 26.58 | Yes | No | No | 20.0 | 30.0 | No | Male | 65-69 | White | Yes | Yes | Fair | 8.0 | Yes | No | No |
3 | No | 24.21 | No | No | No | 0.0 | 0.0 | No | Female | 75-79 | White | No | No | Good | 6.0 | No | No | Yes |
4 | No | 23.71 | No | No | No | 28.0 | 0.0 | Yes | Female | 40-44 | White | No | Yes | Very good | 8.0 | No | No | No |
# Create a new variable for the interaction between BMI * sleeping hours, and PhysicalHealth Days * MentalHealthDays
'BMI_SleepHrs'] = heart_2020['BMI'] * heart_2020['SleepTime']
heart_2020['Days_Off'] = heart_2020['PhysicalHealth'] * heart_2020['MentalHealth']
heart_2020[
'HeartDisease'] = heart_2020['HeartDisease'].astype(str)
heart_2020[
# Ensure proper datatypes before modeling, Numerical Variables = Physical Health, Mental Health, SleepTime, BMI
print(heart_2020.dtypes)
# All categorical variables are objects, and numerical are numbers or floats
HeartDisease object
BMI float64
Smoking object
AlcoholDrinking object
Stroke object
PhysicalHealth float64
MentalHealth float64
DiffWalking object
Sex object
AgeCategory object
Race object
Diabetic object
PhysicalActivity object
GenHealth object
SleepTime float64
Asthma object
KidneyDisease object
SkinCancer object
BMI_SleepHrs float64
Days_Off float64
dtype: object
Apply an XGBoosting Tree Model
# Define X and Y sets, using train_test_split to save 25% for test and use 75% of the data to fit our model
# Initialize label encoder to be able to encode the HeartDisease column
= LabelEncoder()
label_encoder
# Apply label encoder to the 'HeartDisease' column to ensure XGBoost handles it
'HeartDisease'] = label_encoder.fit_transform(heart_2020['HeartDisease'])
heart_2020[
# Define our feature set
= heart_2020.drop(["HeartDisease","Race"], axis = 1)
X = heart_2020["HeartDisease"]
y
# Use stratified splitting to ensure y has a representative % of the target class
= train_test_split(X, y, test_size=0.3, stratify = y)
X_train, X_test, y_train, y_test
# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
= ColumnTransformer(
ct_all
["dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.number)) #keep int64 as is
(
],= "passthrough"
remainder = "pandas")
).set_output(transform
# Use target class distribution to determine values to tune for pos_weight (which takes into account severe class imbalances)
= heart_2020['HeartDisease'].value_counts() # 0 - 292422
target_class_distribution # 1 - 27373
# Define a grid of possible values that are close to 292,422 / 27,373 = 10.68
= [10.6,10.7]
pos_weight_values
# Setup pipeline #1 for XGBoost
= Pipeline([
xgb_pipeline "preprocessor", ct_all),
("classifier", XGBClassifier(use_label_encoder=False, eval_metric='logloss')) # Parameters needed to suppress warnings
(
])
# Parameter grid for XGBoost with different parameters being tuned
= {
param_grid_xgb 'classifier__n_estimators': [100], # Number of boosted trees, reduced from 300 to 100 to run
'classifier__learning_rate': [0.1, 0.5], # Boosting learning rate (xgb’s “eta”)
'classifier__max_depth': [3,4], # Maximum depth of a tree (used 3,4 to have code run)
'classifier__scale_pos_weight': pos_weight_values
}
# Setup GridSearchCV for XGBoost # n_jobs -1 tells Jupyter to user all CPU's for gridsearch
= GridSearchCV(xgb_pipeline, param_grid_xgb, cv=5, scoring='precision', n_jobs=-1)
grid_search_xgb
# Fitting our model on training data
grid_search_xgb.fit(X_train, y_train)
# Predict on the test set
= grid_search_xgb.predict(X_test)
test_xgb_predictions
# Use precision function to compute how often our predictions match the actual Y's
= precision_score(y_test, test_xgb_predictions)
test_xgb_precision
print("Test XGBoost Precision: ", test_xgb_precision)
Test XGBoost Precision: 0.2189363063760989
Apply a ADA Boost Model (Decision Tree as Base Model)
# Define feature set
= heart_2020.drop(["HeartDisease"], axis=1)
X = heart_2020["HeartDisease"]
y
= train_test_split(X, y, test_size=0.3, stratify=y)
X_train, X_test, y_train, y_test
# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
= ColumnTransformer(
ct_all "dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
[(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.number))],
(="passthrough"
remainder
)
# Setup pipeline for AdaBoost with decision tree as the base estimator
= ImblearnPipeline([
ada_pipeline "preprocessor", ct_all),
("smote", SMOTE(random_state=42)),
("classifier", AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), random_state=2, algorithm='SAMME'))
(
])
# Parameter grid for AdaBoost with different parameters being tuned
= {
param_grid_ada 'classifier__n_estimators': [100], # Number of boosted trees, set to 100 to run
'classifier__learning_rate': [0.1] # Boosting learning rate (set to 0.1 to run)
}
# Setup GridSearchCV for AdaBoost
= GridSearchCV(ada_pipeline, param_grid_ada, cv=5, scoring='precision', n_jobs=-1)
grid_search_ada
# Fitting our model on training data
grid_search_ada.fit(X_train, y_train)
# Predict on the test set
= grid_search_ada.predict(X_test)
test_ada_predictions
# Use precision function to compute how often our predictions match the actual Y's, which are coded as 1's
= precision_score(y_test, test_ada_predictions, pos_label = 1)
test_ada_precision
print("Test AdaBoost Precision: ", test_ada_precision)
Test AdaBoost Precision: 0.25
Testing the Best Model: Updated Heart Attack Data 2022
5) heart_2022_final.head(
Sex | GenHealth | PhysicalHealth | MentalHealth | PhysicalActivity | SleepTime | HadHeartAttack | Stroke | Asthma | SkinCancer | KidneyDisease | Diabetic | DiffWalking | Smoking | AgeCategory | BMI | AlcoholDrinking | Race | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | Very good | 4.0 | 0.0 | Yes | 9.0 | No | No | No | No | No | No | No | Former smoker | Age 65 to 69 | 27.99 | No | White |
1 | Male | Very good | 0.0 | 0.0 | Yes | 6.0 | No | No | No | No | No | Yes | No | Former smoker | Age 70 to 74 | 30.13 | No | White |
2 | Male | Very good | 0.0 | 0.0 | No | 8.0 | No | No | No | No | No | No | Yes | Former smoker | Age 75 to 79 | 31.66 | Yes | White |
3 | Female | Fair | 5.0 | 0.0 | Yes | 9.0 | No | No | No | Yes | No | No | Yes | Never smoked | Age 80 or older | 31.32 | No | White |
4 | Female | Good | 3.0 | 15.0 | Yes | 5.0 | No | No | No | No | No | No | No | Never smoked | Age 80 or older | 33.07 | No | White |
Now that we have a well-defined feature set from 2020, that performs somewhat well in predicting the imbalanced target variable, “Heart Disease”. We will re-evaluate these features or “best model” on an updated slice of heart disease data from 2022. This will allow us to actually test if the features or predictor variables that have attribute to a rise or fall in risk for heart disease, are still having an effect 2 years later, following major from Covid-19. To keep the data consistent and allow for an ad-hoc causal analysis to be conducted, I chose to only include the same 19 predictors from our best boosting model for 2020, to ensure that any changes are likely due to societal effect and not the exclusion or inclusion of a variable.
# Create a new variable for the interaction between BMI * sleeping hours, and PhysicalHealth Days * MentalHealthDays
'BMI_SleepHrs'] = heart_2022_final['BMI'] * heart_2022_final['SleepTime']
heart_2022_final['Days_Off'] = heart_2022_final['PhysicalHealth'] * heart_2022_final['MentalHealth']
heart_2022_final[
'HadHeartAttack'] = heart_2022_final['HadHeartAttack'].astype(str)
heart_2022_final[
# Ensure proper datatypes before modeling, Numerical Variables = Physical Health, Mental Health, SleepTime, BMI
print(heart_2022_final.dtypes)
Sex object
GenHealth object
PhysicalHealth float64
MentalHealth float64
PhysicalActivity object
SleepTime float64
HadHeartAttack object
Stroke object
Asthma object
SkinCancer object
KidneyDisease object
Diabetic object
DiffWalking object
Smoking object
AgeCategory object
BMI float64
AlcoholDrinking object
Race object
BMI_SleepHrs float64
Days_Off float64
dtype: object
Final Best Model - ADABoosted Decision Tree
# Initialize label encoder to be able to encode the HeartDisease column
= LabelEncoder()
label_encoder
# Apply label encoder to the 'HeartDisease' column to ensure XGBoost handles it
'HadHeartAttack'] = label_encoder.fit_transform(heart_2022_final['HadHeartAttack'])
heart_2022_final[
# Define the new X test and Y sets, denoted as X_2 and y_2 to avoid confusion
= heart_2022_final.drop(["HadHeartAttack","Race"], axis=1)
X_2 = heart_2022_final["HadHeartAttack"]
y_2
# Use stratified splitting to ensure y has a representative % of the target class
= train_test_split(X_2, y_2, test_size=0.3, stratify=y_2, random_state = 2)
X2_train, X2_test, y2_train, y2_test
# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
= ColumnTransformer(
ct_all "dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
[(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.number))],
(="passthrough"
remainder
)
# Setup pipeline for AdaBoost with decision tree as the base estimator
= ImblearnPipeline([
ada_pipeline "preprocessor", ct_all),
("smote", SMOTE(random_state=42)),
("classifier", AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), random_state=2, algorithm='SAMME'))
(
])
# Parameter grid for AdaBoost with different parameters being tuned
= {
param_grid_ada 'classifier__n_estimators': [100], # Number of boosted trees, set to 100 to run
'classifier__learning_rate': [0.1] # Boosting learning rate (set to 0.1 to run)
}
# Setup GridSearchCV for AdaBoost
= GridSearchCV(ada_pipeline, param_grid_ada, cv=5, scoring='precision', n_jobs=-1)
grid_search_ada
# Fitting our model on training data
grid_search_ada.fit(X2_train, y2_train)
# Predict on the test set
= grid_search_ada.predict(X2_test)
test_ada_predictions
# Use precision function to compute how often our predictions match the actual Y's
= precision_score(y2_test, test_ada_predictions, pos_label = 1)
test_ada_precision
print("Test AdaBoost 2022 Precision: ", test_ada_precision)
Test AdaBoost 2022 Precision: 0.16891402058866173
Metric Selection: Precision
For this context, the performance metric selected could be a range of many things, from accuracy, to f1-score, and even more granular options like predicted true positive rate, Precision. It is obvious due to a severly imbalanced data set that accuracy and even f1-score are greatly affected, and could show great results for a model that simply predict’s every patient to have no heart disease, lending for example a 70% success model. Precision in this case, indicates how many patients with heart disease are correctly captured by the model, or were predicted Yes and in fact had heart disease. And in this case, with real-life data predicting an event that is extremely unlikey to happen but very crucial to catch when it does occur, we wish to maximum precision.
A test precision of 25%, would only be correcting classifying 1/4 of the at people at risk for heart disease, however we are willing to overclassify people as having a risk for heart disease, when they may not actually have it. The other side of this error would be to missclassify patients as healthy, instead of flagging them to come in for a consultation. The potential consequences of this decision are far more serious than the mental scare of being checked out for heart risk but being healthy, only leading to wasted time and potentially minor visit costs. If someone is missclassified by our model as healthy when they are not, this could not only result in a death but also potential legal complications. For these reasons, I selected to focus on Precision as a final metric of validation.
Heart Disease Results
In both years 2020 and 2022, we were able to determine a model capable of somewhat predicting Heart Disease, with a final precision of 16.89% in 2022, which is extremely difficult to detect using data as it is such an anomaly. Utilizing extreme gradient boosting resulted in 20% precision, while applying a combination of ADA boost on a Decision Tree Model resulted in the best precision results with 5% increase to 25% in 2020. Many models would simply predict a “No” for all observations, making it useless and inaccurate. While it is difficult to determine the exact effect of each variable on Heart Disease, it is useful to even have a set of variables or factors that are known to effect risk heart health. Common sense can be used to determine that those who Smoke are more likely to have heart complications, and those who participate in Physical Activities result in a lower chance. There is a large effect due to Age Category, with typically older aged individuals having increased risk as well, while there is a relationship based on the Sex of the individual, however the direction is unknown.
In terms of some variables that had minimal predictive impact in both 2020 and 2022, when removing “Race”, this resulted in an improvement in our model’s predictions, indicating that Race is not a factor majorly contributing to a person’s heart health which could have been expected as there shouldn’t be biological differences in heart health, but rather other demographic factors that contribute to a person’s health. On the other hand, when removing the predictors like Asthma
, Diabetic
, and Physical Activity
, the predictability of our model decreased significantly, meaning that these factors have a profound impact on Health Disease, with a positive relationship for Asthma/Diabetes and negative for those participating in Physical Activity. However, it is worth noting that our predictive ability decreases from 2020 to 2022, from 25% to around 14.6% indicating that there may be some new factors attributing to having a heart attack that were not included in 2020, like having Covid-19. As a next step, we could look into these remaining variables to train a model, in hopes of clearly discovering which one of these factors is the culprit for this change.