Utilizing Boosting to Identify Heart Disease

Author

Riley Svensson

Introduction

The primary goal of this machine learning project is to identify the presence of heart disease in patients using boosting techniques. Boosting is a powerful ensemble method that combines multiple weak learners to create a strong predictive model. By leveraging this approach, we aim to enhance the accuracy and robustness of our heart disease prediction model.

The importance of an analysis like this cannot be overstated. Heart disease remains one of the leading causes of mortality worldwide, and early detection is crucial for effective treatment and prevention. Developing a reliable and accurate predictive model can have significant implications for public health by enabling early detection and intervention, efficiently allocating medical resources, creating personalized treatment plans, advancing medical research, and reducing healthcare costs.

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from sklearn.preprocessing import LabelEncoder
from imblearn.pipeline import Pipeline as ImblearnPipeline
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
import warnings

Read-In Data

# View a head of the data to find out about column naming issues, and missing values 
heart_2020 = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /archive (4) copy/2020/heart_2020_cleaned.csv")
heart_2022 = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /archive (4) copy/2022/heart_2022_no_nans.csv")

# We can see that 2020 had a well-prepared feature set, while 2022 includes some additional variables 
heart_2020.head(10)

	HeartDisease	BMI	Smoking	AlcoholDrinking	Stroke	PhysicalHealth	MentalHealth	DiffWalking	Sex	AgeCategory	Race	Diabetic	PhysicalActivity	GenHealth	SleepTime	Asthma	KidneyDisease	SkinCancer
0	No	16.60	Yes	No	No	3.0	30.0	No	Female	55-59	White	Yes	Yes	Very good	5.0	Yes	No	Yes
1	No	20.34	No	No	Yes	0.0	0.0	No	Female	80 or older	White	No	Yes	Very good	7.0	No	No	No
2	No	26.58	Yes	No	No	20.0	30.0	No	Male	65-69	White	Yes	Yes	Fair	8.0	Yes	No	No
3	No	24.21	No	No	No	0.0	0.0	No	Female	75-79	White	No	No	Good	6.0	No	No	Yes
4	No	23.71	No	No	No	28.0	0.0	Yes	Female	40-44	White	No	Yes	Very good	8.0	No	No	No
5	Yes	28.87	Yes	No	No	6.0	0.0	Yes	Female	75-79	Black	No	No	Fair	12.0	No	No	No
6	No	21.63	No	No	No	15.0	0.0	No	Female	70-74	White	No	Yes	Fair	4.0	Yes	No	Yes
7	No	31.64	Yes	No	No	5.0	0.0	Yes	Female	80 or older	White	Yes	No	Good	9.0	Yes	No	No
8	No	26.45	No	No	No	0.0	0.0	No	Female	80 or older	White	No, borderline diabetes	No	Fair	5.0	No	Yes	No
9	No	40.69	No	No	No	0.0	0.0	Yes	Male	65-69	White	No	Yes	Good	10.0	No	No	No

Data Preparation:

Cleaning 2020 data

We began the data preparation process by thoroughly cleaning the 2020 data. This step was essential to ensure the accuracy and reliability of our machine learning model.

Renaming Variables within the 2022 set

Next, we renamed variables within the 2022 dataset to maintain consistency across different years. This renaming helps in aligning the datasets and simplifies further analysis.

Drop Additional Columns from 2022 to ensure matching feature sets

To ensure matching feature sets between the two datasets, we dropped additional columns from the 2022 data. This step was crucial for creating a unified dataset for our model.

Feature Selection

Cleaning Heart 2020 Data

# For the diabetes, column there are cases with extra information beyond the "Yes" and "No", and we want consistent binary selections
diabetes_value_counts = heart_2020['Diabetic'].value_counts()
diabetes_value_counts

# Define a dictionary for the changes being made
changes_mapped = {
    'No, borderline diabetes': 'No',
    'Yes (during pregnancy)': 'Yes'
}

# Apply the replacements to the 'Diabetic' column
heart_2020['Diabetic'] = heart_2020['Diabetic'].replace(changes_mapped)

Column Renaming

# Before working with these dataframes, I wanted to ensure the columns and their names matched
heart_2022_copy = heart_2022.drop(['State','LastCheckupTime',"RemovedTeeth","TetanusLast10Tdap","LastCheckupTime","HeightInMeters","WeightInKilograms","HIVTesting","FluVaxLast12",
                                  "PneumoVaxEver","TetanusLast10Tdap","HighRiskLastYear","CovidPos","HadAngina","HadCOPD",
                                  "HadDepressiveDisorder","DifficultyDressingBathing","DifficultyErrands","BlindOrVisionDifficulty",
                                  "DifficultyConcentrating","DeafOrHardOfHearing","ECigaretteUsage","HadArthritis","ChestScan"], axis = 1)

heart_2022_copy

# Rename columns to be the same as heart 2020 dataset, for usability and easier understanding of results 
heart_2022_copy.rename(columns={'GeneralHealth': 'GenHealth',
                               'PhysicalHealthDays': 'PhysicalHealth',
                               'MentalHealthDays': 'MentalHealth',
                               'SleepHours': 'SleepTime',
                               'PhysicalActivities':'PhysicalActivity',
                                'HadStroke':'Stroke',
                               'HadAsthma':'Asthma',
                                'HadSkinCancer':'SkinCancer',
                               'HadKidneyDisease': 'KidneyDisease',
                               'HadDiabetes':'Diabetic',
                               'DifficultyWalking':'DiffWalking',
                               'SmokerStatus':'Smoking',
                               'AlcoholDrinkers':'AlcoholDrinking'}, inplace=True)

Variable Cleaning

In terms of variable recoding:

Re-classifying the Race Variable: We re-classified the race variable to be less informative by consolidating it into a single race category, as I aimed to minimize its impact and promote a more equitable analysis
Evaluating the Smoking Status: Initially, we considered dummifying the smoking status to “Yes” or “No.” However, after inspecting the data, we decided that keeping the current smoking status provided more valuable information to our model, so we left it unchanged.

# Build a function that works like a regular expression to find if certain races appear in the column, 
# and removes any additional information that is not neccesarily needed for our analysis 

def classify_race(df):
   
    df['Race'] = df['RaceEthnicityCategory'].apply(
        lambda x: 'White' if 'White only' in x 
                  else ('Black' if 'Black only' in x 
                        else ('Multiracial' if 'Multiracial' in x 
                              else ('Other' if 'Other race' in x else x)))
    )
    return df

# Apply the function to our Heart 2022 dataframe
heart_2022_copy = classify_race(heart_2022_copy)

# Inspect the value counts of each race to ensure our function correctly applied the classification
original_race_counts = heart_2022_copy['RaceEthnicityCategory'].value_counts()
print(original_race_counts)

# Viewing the new and original race counts, we can see that they are the same meaning the function worked
new_race_counts = heart_2022_copy['Race'].value_counts()
print('\n',new_race_counts)

# Drop the RaceEthnicityCategory column from the df as it's data is now in the Race column
heart_2022_final = heart_2022_copy.drop(["RaceEthnicityCategory"], axis = 1)

White only, Non-Hispanic         186336
Hispanic                          22570
Black only, Non-Hispanic          19330
Other race only, Non-Hispanic     12205
Multiracial, Non-Hispanic          5581
Name: RaceEthnicityCategory, dtype: int64

 White          186336
Hispanic        22570
Black           19330
Other           12205
Multiracial      5581
Name: Race, dtype: int64

Training the Model: Heart Disease Data 2020

Here, “training” refers to the initial feature selection process, as we performed model validation (training and testing) on two models from 2020: XGBoost and Ada Boosted Decision Tree. Initially, I considered training solely on the 2020 data and using 2022 as a test set. However, I chose to first investigate the 2020 model’s test error before applying the model to completely unseen data from 2022. This approach ensures the model’s relevance to current data. By doing this, if certain factors are no longer significant in 2022, I can account for this within the 2022 model and denote which factors have lost their importance

heart_2020.head(5)

	HeartDisease	BMI	Smoking	AlcoholDrinking	Stroke	PhysicalHealth	MentalHealth	DiffWalking	Sex	AgeCategory	Race	Diabetic	PhysicalActivity	GenHealth	SleepTime	Asthma	KidneyDisease	SkinCancer
0	No	16.60	Yes	No	No	3.0	30.0	No	Female	55-59	White	Yes	Yes	Very good	5.0	Yes	No	Yes
1	No	20.34	No	No	Yes	0.0	0.0	No	Female	80 or older	White	No	Yes	Very good	7.0	No	No	No
2	No	26.58	Yes	No	No	20.0	30.0	No	Male	65-69	White	Yes	Yes	Fair	8.0	Yes	No	No
3	No	24.21	No	No	No	0.0	0.0	No	Female	75-79	White	No	No	Good	6.0	No	No	Yes
4	No	23.71	No	No	No	28.0	0.0	Yes	Female	40-44	White	No	Yes	Very good	8.0	No	No	No

# Create a new variable for the interaction between BMI * sleeping hours, and PhysicalHealth Days * MentalHealthDays 
heart_2020['BMI_SleepHrs'] = heart_2020['BMI'] * heart_2020['SleepTime']
heart_2020['Days_Off'] = heart_2020['PhysicalHealth'] * heart_2020['MentalHealth']

heart_2020['HeartDisease'] = heart_2020['HeartDisease'].astype(str)

# Ensure proper datatypes before modeling, Numerical Variables = Physical Health, Mental Health, SleepTime, BMI
print(heart_2020.dtypes)

# All categorical variables are objects, and numerical are numbers or floats

HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
BMI_SleepHrs        float64
Days_Off            float64
dtype: object

Apply an XGBoosting Tree Model

# Define X and Y sets, using train_test_split to save 25% for test and use 75% of the data to fit our model

# Initialize label encoder to be able to encode the HeartDisease column
label_encoder = LabelEncoder()

# Apply label encoder to the 'HeartDisease' column to ensure XGBoost handles it 
heart_2020['HeartDisease'] = label_encoder.fit_transform(heart_2020['HeartDisease'])

# Define our feature set
X = heart_2020.drop(["HeartDisease","Race"], axis = 1)
y = heart_2020["HeartDisease"]

# Use stratified splitting to ensure y has a representative % of the target class 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify = y)

# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
ct_all = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))  #keep int64 as is
  ],
  remainder = "passthrough" 
).set_output(transform = "pandas")

# Use target class distribution to determine values to tune for pos_weight (which takes into account severe class imbalances)
target_class_distribution = heart_2020['HeartDisease'].value_counts() # 0 - 292422  
                                                                      # 1 - 27373  

# Define a grid of possible values that are close to 292,422 / 27,373 = 10.68
pos_weight_values = [10.6,10.7]
    
    
# Setup pipeline #1 for XGBoost
xgb_pipeline = Pipeline([
    ("preprocessor", ct_all),
    ("classifier", XGBClassifier(use_label_encoder=False, eval_metric='logloss'))  # Parameters needed to suppress warnings
])

# Parameter grid for XGBoost with different parameters being tuned
param_grid_xgb = {
    'classifier__n_estimators': [100],            # Number of boosted trees, reduced from 300 to 100 to run 
    'classifier__learning_rate': [0.1, 0.5],    # Boosting learning rate (xgb’s “eta”)
    'classifier__max_depth': [3,4],                  # Maximum depth of a tree (used 3,4 to have code run)
    'classifier__scale_pos_weight': pos_weight_values 
}

# Setup GridSearchCV for XGBoost                               # n_jobs -1 tells Jupyter to user all CPU's for gridsearch
grid_search_xgb = GridSearchCV(xgb_pipeline, param_grid_xgb, cv=5, scoring='precision', n_jobs=-1)

# Fitting our model on training data
grid_search_xgb.fit(X_train, y_train)

# Predict on the test set
test_xgb_predictions = grid_search_xgb.predict(X_test)

# Use precision function to compute how often our predictions match the actual Y's
test_xgb_precision = precision_score(y_test, test_xgb_predictions)

print("Test XGBoost Precision: ", test_xgb_precision)

Test XGBoost Precision:  0.2189363063760989

Apply a ADA Boost Model (Decision Tree as Base Model)

# Define feature set

X = heart_2020.drop(["HeartDisease"], axis=1)
y = heart_2020["HeartDisease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
ct_all = ColumnTransformer(
    [("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
      make_column_selector(dtype_include=object)),
     ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))],
    remainder="passthrough"
)

# Setup pipeline for AdaBoost with decision tree as the base estimator
ada_pipeline = ImblearnPipeline([
    ("preprocessor", ct_all),
    ("smote", SMOTE(random_state=42)),
    ("classifier", AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), random_state=2, algorithm='SAMME'))
])

# Parameter grid for AdaBoost with different parameters being tuned
param_grid_ada = {
    'classifier__n_estimators': [100],  # Number of boosted trees, set to 100 to run
    'classifier__learning_rate': [0.1]      # Boosting learning rate (set to 0.1 to run)
}

# Setup GridSearchCV for AdaBoost
grid_search_ada = GridSearchCV(ada_pipeline, param_grid_ada, cv=5, scoring='precision', n_jobs=-1)

# Fitting our model on training data
grid_search_ada.fit(X_train, y_train)

# Predict on the test set
test_ada_predictions = grid_search_ada.predict(X_test)

# Use precision function to compute how often our predictions match the actual Y's, which are coded as 1's
test_ada_precision = precision_score(y_test, test_ada_predictions, pos_label = 1)

print("Test AdaBoost Precision: ", test_ada_precision)

Test AdaBoost Precision:  0.25

Testing the Best Model: Updated Heart Attack Data 2022

heart_2022_final.head(5)

	Sex	GenHealth	PhysicalHealth	MentalHealth	PhysicalActivity	SleepTime	HadHeartAttack	Stroke	Asthma	SkinCancer	KidneyDisease	Diabetic	DiffWalking	Smoking	AgeCategory	BMI	AlcoholDrinking	Race
0	Female	Very good	4.0	0.0	Yes	9.0	No	No	No	No	No	No	No	Former smoker	Age 65 to 69	27.99	No	White
1	Male	Very good	0.0	0.0	Yes	6.0	No	No	No	No	No	Yes	No	Former smoker	Age 70 to 74	30.13	No	White
2	Male	Very good	0.0	0.0	No	8.0	No	No	No	No	No	No	Yes	Former smoker	Age 75 to 79	31.66	Yes	White
3	Female	Fair	5.0	0.0	Yes	9.0	No	No	No	Yes	No	No	Yes	Never smoked	Age 80 or older	31.32	No	White
4	Female	Good	3.0	15.0	Yes	5.0	No	No	No	No	No	No	No	Never smoked	Age 80 or older	33.07	No	White

Now that we have a well-defined feature set from 2020, that performs somewhat well in predicting the imbalanced target variable, “Heart Disease”. We will re-evaluate these features or “best model” on an updated slice of heart disease data from 2022. This will allow us to actually test if the features or predictor variables that have attribute to a rise or fall in risk for heart disease, are still having an effect 2 years later, following major from Covid-19. To keep the data consistent and allow for an ad-hoc causal analysis to be conducted, I chose to only include the same 19 predictors from our best boosting model for 2020, to ensure that any changes are likely due to societal effect and not the exclusion or inclusion of a variable.

# Create a new variable for the interaction between BMI * sleeping hours, and PhysicalHealth Days * MentalHealthDays 
heart_2022_final['BMI_SleepHrs'] = heart_2022_final['BMI'] * heart_2022_final['SleepTime']
heart_2022_final['Days_Off'] = heart_2022_final['PhysicalHealth'] * heart_2022_final['MentalHealth']

heart_2022_final['HadHeartAttack'] = heart_2022_final['HadHeartAttack'].astype(str)

# Ensure proper datatypes before modeling, Numerical Variables = Physical Health, Mental Health, SleepTime, BMI
print(heart_2022_final.dtypes)

Sex                  object
GenHealth            object
PhysicalHealth      float64
MentalHealth        float64
PhysicalActivity     object
SleepTime           float64
HadHeartAttack       object
Stroke               object
Asthma               object
SkinCancer           object
KidneyDisease        object
Diabetic             object
DiffWalking          object
Smoking              object
AgeCategory          object
BMI                 float64
AlcoholDrinking      object
Race                 object
BMI_SleepHrs        float64
Days_Off            float64
dtype: object

Final Best Model - ADABoosted Decision Tree

# Initialize label encoder to be able to encode the HeartDisease column
label_encoder = LabelEncoder()

# Apply label encoder to the 'HeartDisease' column to ensure XGBoost handles it 
heart_2022_final['HadHeartAttack'] = label_encoder.fit_transform(heart_2022_final['HadHeartAttack'])

# Define the new X test and Y sets, denoted as X_2 and y_2 to avoid confusion
X_2 = heart_2022_final.drop(["HadHeartAttack","Race"], axis=1)
y_2 = heart_2022_final["HadHeartAttack"]

# Use stratified splitting to ensure y has a representative % of the target class
X2_train, X2_test, y2_train, y2_test = train_test_split(X_2, y_2, test_size=0.3, stratify=y_2, random_state = 2)

# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
ct_all = ColumnTransformer(
    [("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
      make_column_selector(dtype_include=object)),
     ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))],
    remainder="passthrough"
)

# Setup pipeline for AdaBoost with decision tree as the base estimator
ada_pipeline = ImblearnPipeline([
    ("preprocessor", ct_all),
    ("smote", SMOTE(random_state=42)),
    ("classifier", AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), random_state=2, algorithm='SAMME'))
])

# Parameter grid for AdaBoost with different parameters being tuned
param_grid_ada = {
    'classifier__n_estimators': [100],  # Number of boosted trees, set to 100 to run
    'classifier__learning_rate': [0.1]      # Boosting learning rate (set to 0.1 to run)
}

# Setup GridSearchCV for AdaBoost
grid_search_ada = GridSearchCV(ada_pipeline, param_grid_ada, cv=5, scoring='precision', n_jobs=-1)

# Fitting our model on training data
grid_search_ada.fit(X2_train, y2_train)

# Predict on the test set
test_ada_predictions = grid_search_ada.predict(X2_test)

# Use precision function to compute how often our predictions match the actual Y's
test_ada_precision = precision_score(y2_test, test_ada_predictions, pos_label = 1)

print("Test AdaBoost 2022 Precision: ", test_ada_precision)

Test AdaBoost 2022 Precision:  0.16891402058866173

Metric Selection: Precision

For this context, the performance metric selected could be a range of many things, from accuracy, to f1-score, and even more granular options like predicted true positive rate, Precision. It is obvious due to a severly imbalanced data set that accuracy and even f1-score are greatly affected, and could show great results for a model that simply predict’s every patient to have no heart disease, lending for example a 70% success model. Precision in this case, indicates how many patients with heart disease are correctly captured by the model, or were predicted Yes and in fact had heart disease. And in this case, with real-life data predicting an event that is extremely unlikey to happen but very crucial to catch when it does occur, we wish to maximum precision.

A test precision of 25%, would only be correcting classifying 1/4 of the at people at risk for heart disease, however we are willing to overclassify people as having a risk for heart disease, when they may not actually have it. The other side of this error would be to missclassify patients as healthy, instead of flagging them to come in for a consultation. The potential consequences of this decision are far more serious than the mental scare of being checked out for heart risk but being healthy, only leading to wasted time and potentially minor visit costs. If someone is missclassified by our model as healthy when they are not, this could not only result in a death but also potential legal complications. For these reasons, I selected to focus on Precision as a final metric of validation.

Heart Disease Results

In both years 2020 and 2022, we were able to determine a model capable of somewhat predicting Heart Disease, with a final precision of 16.89% in 2022, which is extremely difficult to detect using data as it is such an anomaly. Utilizing extreme gradient boosting resulted in 20% precision, while applying a combination of ADA boost on a Decision Tree Model resulted in the best precision results with 5% increase to 25% in 2020. Many models would simply predict a “No” for all observations, making it useless and inaccurate. While it is difficult to determine the exact effect of each variable on Heart Disease, it is useful to even have a set of variables or factors that are known to effect risk heart health. Common sense can be used to determine that those who Smoke are more likely to have heart complications, and those who participate in Physical Activities result in a lower chance. There is a large effect due to Age Category, with typically older aged individuals having increased risk as well, while there is a relationship based on the Sex of the individual, however the direction is unknown.

In terms of some variables that had minimal predictive impact in both 2020 and 2022, when removing “Race”, this resulted in an improvement in our model’s predictions, indicating that Race is not a factor majorly contributing to a person’s heart health which could have been expected as there shouldn’t be biological differences in heart health, but rather other demographic factors that contribute to a person’s health. On the other hand, when removing the predictors like Asthma, Diabetic, and Physical Activity, the predictability of our model decreased significantly, meaning that these factors have a profound impact on Health Disease, with a positive relationship for Asthma/Diabetes and negative for those participating in Physical Activity. However, it is worth noting that our predictive ability decreases from 2020 to 2022, from 25% to around 14.6% indicating that there may be some new factors attributing to having a heart attack that were not included in 2020, like having Covid-19. As a next step, we could look into these remaining variables to train a model, in hopes of clearly discovering which one of these factors is the culprit for this change.