Utilizing Bagging & Stacking Ensembles to Predict Sleep Score

Utilizing Bagging & Stacking Ensembles to Predict Sleep Score

Author

Riley Svensson

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from plotnine import *

1 Introduction

With the proliferation of wearable technology, devices like FitBit have become integral in monitoring various aspects of our health, including sleep patterns. One of the key metrics provided by FitBit is the Sleep Score, which offers insights into the quality of sleep based on several parameters. However, the exact formula used by FitBit to calculate this score is proprietary. This lab aims to reverse-engineer the Sleep Score calculation by utilizing various machine learning techniques.

The goal is to predict Sleep Score using other metrics available in the dataset, such as hours of sleep, deep sleep percentage, REM sleep percentage, and heart rate below resting. By applying bagging and stacking ensemble methods, we aim to create a model that can closely approximate the Sleep Score as calculated by FitBit. Bagging, short for Bootstrap Aggregating, involves creating multiple training sets from the original dataset, building separate models for each set, and then averaging their predictions to produce a single, low-variance model. Stacking, on the other hand, is a technique where multiple model types are trained on the same dataset, and their predictions are then used as inputs to a second-level model known as the Meta-Model. This model learns how to best combine the predictions of the base models, resulting in improved accuracy and robustness.

1.1 Importing the Sleep data

# View a head of the data to find out about column naming issues, and missing values 
test = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /November Sleep Data - Sheet1.csv")
test.head(10)
NOVEMBER DATE SLEEP SCORE HOURS OF SLEEP REM SLEEP DEEP SLEEP HEART RATE BELOW RESTING SLEEP TIME
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 Monday 11/1/2021 88.0 8:06:00 20.00% 13.00% 84.00% 10:41pm - 7:54am
2 Tuesday 11/2/2021 83.0 7:57:00 12.00% 18.00% 90.00% 10:40pm - 7:55am
3 Wednesday 11/3/2021 81.0 7:06:00 13.00% 22.00% 93.00% 11:03pm - 7:16am
4 Thursday 11/4/2021 86.0 7:04:00 19.00% 17.00% 97.00% 10:55pm - 6:56am
5 Friday 11/5/2021 81.0 9:24:00 17.00% 15.00% 66.00% 10:14pm - 9:01am
6 Saturday 11/6/2021 77.0 8:19:00 14.00% 13.00% 21.00% 11:21 - 8:45am
7 Sunday 11/7/2021 86.0 7:37:00 20.00% 15.00% 80.00% 11:18pm - 6:57am
8 Monday 11/8/2021 81.0 6:46 16% 17.00% 89.00% 11:50pm - 7:40am
9 Tuesday 11/9/2021 88.0 7:10:00 24.00% 19.00% 98.00% 11:14pm - 7:16am

1.2 Clean & Concatenate Data

nov = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /November Sleep Data - Sheet1.csv")
dec = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /December Sleep data - Sheet1.csv")
jan = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /January sleep data - Sheet1.csv")
feb = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /February sleep data - Sheet1 (1).csv")
mar = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /March sleep data - Sheet1.csv")
apr = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /April sleep data - Sheet1.csv")

# Rename Column with Month name to appropriate name Weekday, as this is what the column holds
apr.rename(columns={'APRIL': 'WEEKDAY'}, inplace=True)
nov.rename(columns={'NOVEMBER': 'WEEKDAY'}, inplace=True)
dec.rename(columns={'DECEMBER': 'WEEKDAY'}, inplace=True)
jan.rename(columns={'JANUARY': 'WEEKDAY'}, inplace=True)
feb.rename(columns={'FEBEUARY': 'WEEKDAY'}, inplace=True)
mar.rename(columns={'MARCH': 'WEEKDAY'}, inplace=True)

# Fix naming of columns to have consistency across all 6 files 
jan.rename(columns={'HEART RATE UNDER RESTING': 'HEART RATE BELOW RESTING'}, inplace=True)
feb.rename(columns={'SLEEP SQORE': 'SLEEP SCORE'}, inplace=True)
mar.rename(columns={'HEARTRATE BELOW RESTING': 'HEART RATE BELOW RESTING'}, inplace=True)


# Stack these df's to create a complete dataset
sleep_data_total = pd.concat([nov, dec, jan, feb, mar, apr], ignore_index=True)

# Drop rows with any missing values
sleep_data = sleep_data_total.dropna()
sleep_data.head(10)
WEEKDAY DATE SLEEP SCORE HOURS OF SLEEP REM SLEEP DEEP SLEEP HEART RATE BELOW RESTING SLEEP TIME
1 Monday 11/1/2021 88.0 8:06:00 20.00% 13.00% 84.00% 10:41pm - 7:54am
2 Tuesday 11/2/2021 83.0 7:57:00 12.00% 18.00% 90.00% 10:40pm - 7:55am
3 Wednesday 11/3/2021 81.0 7:06:00 13.00% 22.00% 93.00% 11:03pm - 7:16am
4 Thursday 11/4/2021 86.0 7:04:00 19.00% 17.00% 97.00% 10:55pm - 6:56am
5 Friday 11/5/2021 81.0 9:24:00 17.00% 15.00% 66.00% 10:14pm - 9:01am
6 Saturday 11/6/2021 77.0 8:19:00 14.00% 13.00% 21.00% 11:21 - 8:45am
7 Sunday 11/7/2021 86.0 7:37:00 20.00% 15.00% 80.00% 11:18pm - 6:57am
8 Monday 11/8/2021 81.0 6:46 16% 17.00% 89.00% 11:50pm - 7:40am
9 Tuesday 11/9/2021 88.0 7:10:00 24.00% 19.00% 98.00% 11:14pm - 7:16am
10 Wednesday 11/10/2021 84.0 7:16:00 21.00% 17.00% 96.00% 11:16pm - 7:29am

1.3 Data Wrangling & Feature Preparation

First, I cleaned and prepared the dataset by handling missing values and splitting the SLEEP TIME column into Sleep Start and Sleep End times. I then converted the hours of sleep into a standardized float format and transformed percentage values for deep sleep, REM sleep, and heart rate below resting into decimal form. I engineered new interaction features, such as hours of REM sleep, deep sleep, and heart rate below resting, to capture combined effects. To ensure accurate time calculations, I corrected time formats and calculated the total sleep duration, which allowed me to derive the sleep rate as a measure of sleep efficiency, like SLEEP RATE. Finally, I extracted and split the date into year, month, and day components to create additional features. This feature selection process was iterative, constantly refined to achieve the best predictive performance.

sleep_data_final = sleep_data.copy()

# Splitting the 'SLEEP TIME' column into 'Sleep Start' and 'Sleep End'
sleep_data_final[['Sleep Start', 'Sleep End']] = sleep_data_final['SLEEP TIME'].str.split(' - ', expand=True)

# Convert Hours of Sleep to float, some times in hh:mm format only
sleep_data_final['HOURS OF SLEEP'] = sleep_data_final['HOURS OF SLEEP'].apply(
    lambda x: pd.to_timedelta(x if x.count(':') == 2 else x + ':00').total_seconds() / 3600  # seconds
)

# Converting Deep Sleep + REM to be decimals for calculation, strip % sign, make float type, and divide by 100
sleep_data_final['DEEP SLEEP'] = sleep_data_final['DEEP SLEEP'].str.rstrip('%').astype(float) / 100
sleep_data_final['REM SLEEP'] = sleep_data_final['REM SLEEP'].str.rstrip('%').astype(float) / 100
sleep_data_final['HEART RATE BELOW RESTING'] = sleep_data_final['HEART RATE BELOW RESTING'].str.rstrip('%').astype(float) / 100


# Creating an interaction between (Hours of Sleep & REM Sleep %) and (Hours of Sleep * Deep Sleep %)
sleep_data_final['Hours REM Sleep'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['REM SLEEP']
sleep_data_final['Hours Deep Sleep'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['DEEP SLEEP']
sleep_data_final['Hours HRBR'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['HEART RATE BELOW RESTING']

# Calculate total sleep time, so we can divide HRS of SLEEP by it to find % sleeping

def add_am_pm(time_str):
    # Check if 'AM' or 'PM' is already present
    if 'am' in time_str.lower() or 'pm' in time_str.lower():
        return time_str
    hour = int(time_str.split(':')[0])
    return time_str + ('PM' if 7 <= hour <= 11 else 'AM')

# Apply the function to correct the time format
sleep_data_final['Sleep Start'] = sleep_data_final['Sleep Start'].apply(add_am_pm)
sleep_data_final['Sleep End'] = sleep_data_final['Sleep End'].apply(add_am_pm)

# Now attempt to convert to datetime
sleep_data_final['Sleep Start'] = pd.to_datetime(sleep_data_final['Sleep Start'], format='%I:%M%p', errors='coerce')
sleep_data_final['Sleep End'] = pd.to_datetime(sleep_data_final['Sleep End'], format='%I:%M%p', errors='coerce')

# Function to calculate duration considering possible midnight crossover
def calculate_duration(row):
    start = row['Sleep Start']
    end = row['Sleep End']
    if pd.isnull(start) or pd.isnull(end):
        return None
    if end < start:
        end += pd.Timedelta(days=1)
    return (end - start).total_seconds() / 3600

# Apply duration calculation to get Total Sleep Time
sleep_data_final['Total Sleep Time'] = sleep_data_final.apply(calculate_duration, axis=1)

# Drop repeated columns used to calculate Total Sleep Time
sleep_data_final.drop(['SLEEP TIME','Sleep Start','Sleep End'], axis = 1)

# Calculate Rate of Sleeping (Hours of Sleep / Total Sleep Time)
sleep_data_final['Sleep Rate'] = sleep_data_final['HOURS OF SLEEP'] / sleep_data_final['Total Sleep Time']

1.4 Potential Feature Selection

features = sleep_data_final.drop(['SLEEP TIME','Sleep Start','Sleep End'], axis = 1)

# Date should be converted from an object to Datetime, and split to YEAR, MONTH, DAY
features['DATE'] = pd.to_datetime(features['DATE'])

features['YEAR'] = features['DATE'].dt.year
features['MONTH'] = features['DATE'].dt.month
features['DAY'] = features['DATE'].dt.day

features_final = features.dropna()

1.5 View the Cleaned Dataset for Modeling

features_final.head(10)
WEEKDAY DATE SLEEP SCORE HOURS OF SLEEP REM SLEEP DEEP SLEEP HEART RATE BELOW RESTING Hours REM Sleep Hours Deep Sleep Hours HRBR Total Sleep Time Sleep Rate YEAR MONTH DAY
1 Monday 2021-11-01 88.0 8.100000 0.20 0.13 0.84 1.620000 1.053000 6.804000 9.216667 0.878843 2021 11 1
2 Tuesday 2021-11-02 83.0 7.950000 0.12 0.18 0.90 0.954000 1.431000 7.155000 9.250000 0.859459 2021 11 2
3 Wednesday 2021-11-03 81.0 7.100000 0.13 0.22 0.93 0.923000 1.562000 6.603000 8.216667 0.864097 2021 11 3
4 Thursday 2021-11-04 86.0 7.066667 0.19 0.17 0.97 1.342667 1.201333 6.854667 8.016667 0.881497 2021 11 4
5 Friday 2021-11-05 81.0 9.400000 0.17 0.15 0.66 1.598000 1.410000 6.204000 10.783333 0.871716 2021 11 5
6 Saturday 2021-11-06 77.0 8.316667 0.14 0.13 0.21 1.164333 1.081167 1.746500 9.400000 0.884752 2021 11 6
7 Sunday 2021-11-07 86.0 7.616667 0.20 0.15 0.80 1.523333 1.142500 6.093333 7.650000 0.995643 2021 11 7
8 Monday 2021-11-08 81.0 6.766667 0.16 0.17 0.89 1.082667 1.150333 6.022333 7.833333 0.863830 2021 11 8
9 Tuesday 2021-11-09 88.0 7.166667 0.24 0.19 0.98 1.720000 1.361667 7.023333 8.033333 0.892116 2021 11 9
10 Wednesday 2021-11-10 84.0 7.266667 0.21 0.17 0.96 1.526000 1.235333 6.976000 8.216667 0.884381 2021 11 10


1.6 Feature Selection: Final

After iterating through potential features and refining my dataset, I defined the feature set (X) and target variable (y). The feature set included all relevant predictors except for the SLEEP SCORE, which was the target variable, and columns like DATE, DAY, WEEKDAY, and YEAR which were either redundant or not directly useful for prediction. Next, I split the data into training and testing sets to evaluate the model’s performance, using a 70-30 split, with 70% of the data used for training and 30% for testing.

To prepare the data for modeling, I created a column transformer that handled both categorical and numerical features. Categorical features were one-hot encoded to convert them into a numerical format, while numerical features were standardized to ensure they were on a comparable scale.

# Define X and Y sets

X = features_final.drop(["SLEEP SCORE", "DATE","DAY","WEEKDAY","YEAR"], axis = 1)
y = features_final["SLEEP SCORE"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 2)

# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
ct_all = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.float64))  #keep int64 as is
  ],
  remainder = "passthrough" 
).set_output(transform = "pandas")


1.7 Define the Base Models

# Define base models within the preprocessing pipeline to apply consistent scaling/dummification

base_models = [
    ('knn', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', KNeighborsRegressor(n_neighbors=15)) #experimented with different values 5,10,15,20
    ])),
    ('svr', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', SVR(kernel='linear'))  # linear kernel performed best
    ])),
    ('bagging_tree', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', BaggingRegressor(n_estimators=100, random_state=2))  #n_estimators was optimal at 100
    ])),
    ('random_forest', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=2))  #n_estimators was optimal at 100
    ])),
    ('gradient_boosting', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', GradientBoostingRegressor(n_estimators=100, random_state=2))  #n_estimators was optimal at 100
    ])),
    ('ridge', Pipeline([
        ('preprocessor', ct_all),
        ('regressor', Ridge(alpha=0.8))  # experimented with 0.5 - 1.0, + 0.1
    ]))
]


1.8 Model Results: Mean Absolute Error

# Define the final regressor method, typically Linear or Logistic, makes sense to use Linear here 
final_regressor = LinearRegression()

# Create the stacking ensemble
stacking_regressor = StackingRegressor(estimators=base_models, final_estimator=final_regressor, cv=5)

# Fit the stacking model on the training data
stacking_regressor.fit(X_train, y_train)

# Predict/evaluate the model on the Test data
stacking_predictions = stacking_regressor.predict(X_test)
stacking_mae = mean_absolute_error(y_test, stacking_predictions)

# Stacking model performance 
print("Stacking Model MAE: ", stacking_mae)
Stacking Model MAE:  1.3524256168789923


1.9 Conclusion

Following the utilization of several Bagging methods, used to create a Stacked model, I was able to reduce the mean absolute error down to 1.352, which is a significant achievement when taking into account that Sleep Score scores typically range from 70-90 (indicating a small error). In terms of reverse engineering a formula used by FitBit to calculate our Sleep Score, I landed upon several features that can be used to accurately predict Sleep Score:

Sleep ScoreHours of Sleep+REM Sleep +Deep Sleep +HRBR +HTBR Hours+REM Hours+Deep Sleep Hours+Total Sleep Time+Sleep Rate+Month

To add some predictive power to the model, I had to create and transform some of the existing variables in the set, such as splitting the Date into DAY, YEAR, MONTH, columns to be used later (only Month proved useful), or wrangling the percentage variables to be in the form of two digit decimals to allow for easy computation with Python. Once complete, I was able to multiply Hours of Sleep by these floats to get an overall hours value of each sleep type. Lastly, I noticed that the sleep time was longer than the actual Hours of Sleep for every individual, leading me to believe a % or Sleep Rate should be calculated to give a sense of sleep efficiency. With these features, removing any duplicates or similar values that appear in multiple columns, my stacking model with Linear Regression as a final estimator, and Ridge, SVR, Gradient Boosting, Bagging, Random Forest, and KNN as base models, was able to accurately predict Sleep Score.