Utilizing Bagging & Stacking Ensembles to Predict Sleep Score
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import *
from plotnine import *
1 Introduction
With the proliferation of wearable technology, devices like FitBit have become integral in monitoring various aspects of our health, including sleep patterns. One of the key metrics provided by FitBit is the Sleep Score, which offers insights into the quality of sleep based on several parameters. However, the exact formula used by FitBit to calculate this score is proprietary. This lab aims to reverse-engineer the Sleep Score calculation by utilizing various machine learning techniques.
The goal is to predict Sleep Score
using other metrics available in the dataset, such as hours of sleep, deep sleep percentage, REM sleep percentage, and heart rate below resting. By applying bagging and stacking ensemble methods, we aim to create a model that can closely approximate the Sleep Score as calculated by FitBit. Bagging, short for Bootstrap Aggregating, involves creating multiple training sets from the original dataset, building separate models for each set, and then averaging their predictions to produce a single, low-variance model. Stacking, on the other hand, is a technique where multiple model types are trained on the same dataset, and their predictions are then used as inputs to a second-level model known as the Meta-Model. This model learns how to best combine the predictions of the base models, resulting in improved accuracy and robustness.
1.1 Importing the Sleep data
# View a head of the data to find out about column naming issues, and missing values
= pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /November Sleep Data - Sheet1.csv")
test 10) test.head(
NOVEMBER | DATE | SLEEP SCORE | HOURS OF SLEEP | REM SLEEP | DEEP SLEEP | HEART RATE BELOW RESTING | SLEEP TIME | |
---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Monday | 11/1/2021 | 88.0 | 8:06:00 | 20.00% | 13.00% | 84.00% | 10:41pm - 7:54am |
2 | Tuesday | 11/2/2021 | 83.0 | 7:57:00 | 12.00% | 18.00% | 90.00% | 10:40pm - 7:55am |
3 | Wednesday | 11/3/2021 | 81.0 | 7:06:00 | 13.00% | 22.00% | 93.00% | 11:03pm - 7:16am |
4 | Thursday | 11/4/2021 | 86.0 | 7:04:00 | 19.00% | 17.00% | 97.00% | 10:55pm - 6:56am |
5 | Friday | 11/5/2021 | 81.0 | 9:24:00 | 17.00% | 15.00% | 66.00% | 10:14pm - 9:01am |
6 | Saturday | 11/6/2021 | 77.0 | 8:19:00 | 14.00% | 13.00% | 21.00% | 11:21 - 8:45am |
7 | Sunday | 11/7/2021 | 86.0 | 7:37:00 | 20.00% | 15.00% | 80.00% | 11:18pm - 6:57am |
8 | Monday | 11/8/2021 | 81.0 | 6:46 | 16% | 17.00% | 89.00% | 11:50pm - 7:40am |
9 | Tuesday | 11/9/2021 | 88.0 | 7:10:00 | 24.00% | 19.00% | 98.00% | 11:14pm - 7:16am |
1.2 Clean & Concatenate Data
= pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /November Sleep Data - Sheet1.csv")
nov = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /December Sleep data - Sheet1.csv")
dec = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /January sleep data - Sheet1.csv")
jan = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /February sleep data - Sheet1 (1).csv")
feb = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /March sleep data - Sheet1.csv")
mar = pd.read_csv("/Users/rileysvensson/Desktop/GSB 545 - Advanced Machine Learning/data /April sleep data - Sheet1.csv")
apr
# Rename Column with Month name to appropriate name Weekday, as this is what the column holds
={'APRIL': 'WEEKDAY'}, inplace=True)
apr.rename(columns={'NOVEMBER': 'WEEKDAY'}, inplace=True)
nov.rename(columns={'DECEMBER': 'WEEKDAY'}, inplace=True)
dec.rename(columns={'JANUARY': 'WEEKDAY'}, inplace=True)
jan.rename(columns={'FEBEUARY': 'WEEKDAY'}, inplace=True)
feb.rename(columns={'MARCH': 'WEEKDAY'}, inplace=True)
mar.rename(columns
# Fix naming of columns to have consistency across all 6 files
={'HEART RATE UNDER RESTING': 'HEART RATE BELOW RESTING'}, inplace=True)
jan.rename(columns={'SLEEP SQORE': 'SLEEP SCORE'}, inplace=True)
feb.rename(columns={'HEARTRATE BELOW RESTING': 'HEART RATE BELOW RESTING'}, inplace=True)
mar.rename(columns
# Stack these df's to create a complete dataset
= pd.concat([nov, dec, jan, feb, mar, apr], ignore_index=True)
sleep_data_total
# Drop rows with any missing values
= sleep_data_total.dropna()
sleep_data 10) sleep_data.head(
WEEKDAY | DATE | SLEEP SCORE | HOURS OF SLEEP | REM SLEEP | DEEP SLEEP | HEART RATE BELOW RESTING | SLEEP TIME | |
---|---|---|---|---|---|---|---|---|
1 | Monday | 11/1/2021 | 88.0 | 8:06:00 | 20.00% | 13.00% | 84.00% | 10:41pm - 7:54am |
2 | Tuesday | 11/2/2021 | 83.0 | 7:57:00 | 12.00% | 18.00% | 90.00% | 10:40pm - 7:55am |
3 | Wednesday | 11/3/2021 | 81.0 | 7:06:00 | 13.00% | 22.00% | 93.00% | 11:03pm - 7:16am |
4 | Thursday | 11/4/2021 | 86.0 | 7:04:00 | 19.00% | 17.00% | 97.00% | 10:55pm - 6:56am |
5 | Friday | 11/5/2021 | 81.0 | 9:24:00 | 17.00% | 15.00% | 66.00% | 10:14pm - 9:01am |
6 | Saturday | 11/6/2021 | 77.0 | 8:19:00 | 14.00% | 13.00% | 21.00% | 11:21 - 8:45am |
7 | Sunday | 11/7/2021 | 86.0 | 7:37:00 | 20.00% | 15.00% | 80.00% | 11:18pm - 6:57am |
8 | Monday | 11/8/2021 | 81.0 | 6:46 | 16% | 17.00% | 89.00% | 11:50pm - 7:40am |
9 | Tuesday | 11/9/2021 | 88.0 | 7:10:00 | 24.00% | 19.00% | 98.00% | 11:14pm - 7:16am |
10 | Wednesday | 11/10/2021 | 84.0 | 7:16:00 | 21.00% | 17.00% | 96.00% | 11:16pm - 7:29am |
1.3 Data Wrangling & Feature Preparation
First, I cleaned and prepared the dataset by handling missing values and splitting the SLEEP TIME
column into Sleep Start
and Sleep End
times. I then converted the hours of sleep into a standardized float format and transformed percentage values for deep sleep, REM sleep, and heart rate below resting into decimal form. I engineered new interaction features, such as hours of REM sleep, deep sleep, and heart rate below resting, to capture combined effects. To ensure accurate time calculations, I corrected time formats and calculated the total sleep duration, which allowed me to derive the sleep rate as a measure of sleep efficiency, like SLEEP RATE
. Finally, I extracted and split the date into year, month, and day components to create additional features. This feature selection process was iterative, constantly refined to achieve the best predictive performance.
= sleep_data.copy()
sleep_data_final
# Splitting the 'SLEEP TIME' column into 'Sleep Start' and 'Sleep End'
'Sleep Start', 'Sleep End']] = sleep_data_final['SLEEP TIME'].str.split(' - ', expand=True)
sleep_data_final[[
# Convert Hours of Sleep to float, some times in hh:mm format only
'HOURS OF SLEEP'] = sleep_data_final['HOURS OF SLEEP'].apply(
sleep_data_final[lambda x: pd.to_timedelta(x if x.count(':') == 2 else x + ':00').total_seconds() / 3600 # seconds
)
# Converting Deep Sleep + REM to be decimals for calculation, strip % sign, make float type, and divide by 100
'DEEP SLEEP'] = sleep_data_final['DEEP SLEEP'].str.rstrip('%').astype(float) / 100
sleep_data_final['REM SLEEP'] = sleep_data_final['REM SLEEP'].str.rstrip('%').astype(float) / 100
sleep_data_final['HEART RATE BELOW RESTING'] = sleep_data_final['HEART RATE BELOW RESTING'].str.rstrip('%').astype(float) / 100
sleep_data_final[
# Creating an interaction between (Hours of Sleep & REM Sleep %) and (Hours of Sleep * Deep Sleep %)
'Hours REM Sleep'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['REM SLEEP']
sleep_data_final['Hours Deep Sleep'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['DEEP SLEEP']
sleep_data_final['Hours HRBR'] = sleep_data_final['HOURS OF SLEEP'] * sleep_data_final['HEART RATE BELOW RESTING']
sleep_data_final[
# Calculate total sleep time, so we can divide HRS of SLEEP by it to find % sleeping
def add_am_pm(time_str):
# Check if 'AM' or 'PM' is already present
if 'am' in time_str.lower() or 'pm' in time_str.lower():
return time_str
= int(time_str.split(':')[0])
hour return time_str + ('PM' if 7 <= hour <= 11 else 'AM')
# Apply the function to correct the time format
'Sleep Start'] = sleep_data_final['Sleep Start'].apply(add_am_pm)
sleep_data_final['Sleep End'] = sleep_data_final['Sleep End'].apply(add_am_pm)
sleep_data_final[
# Now attempt to convert to datetime
'Sleep Start'] = pd.to_datetime(sleep_data_final['Sleep Start'], format='%I:%M%p', errors='coerce')
sleep_data_final['Sleep End'] = pd.to_datetime(sleep_data_final['Sleep End'], format='%I:%M%p', errors='coerce')
sleep_data_final[
# Function to calculate duration considering possible midnight crossover
def calculate_duration(row):
= row['Sleep Start']
start = row['Sleep End']
end if pd.isnull(start) or pd.isnull(end):
return None
if end < start:
+= pd.Timedelta(days=1)
end return (end - start).total_seconds() / 3600
# Apply duration calculation to get Total Sleep Time
'Total Sleep Time'] = sleep_data_final.apply(calculate_duration, axis=1)
sleep_data_final[
# Drop repeated columns used to calculate Total Sleep Time
'SLEEP TIME','Sleep Start','Sleep End'], axis = 1)
sleep_data_final.drop([
# Calculate Rate of Sleeping (Hours of Sleep / Total Sleep Time)
'Sleep Rate'] = sleep_data_final['HOURS OF SLEEP'] / sleep_data_final['Total Sleep Time'] sleep_data_final[
1.4 Potential Feature Selection
= sleep_data_final.drop(['SLEEP TIME','Sleep Start','Sleep End'], axis = 1)
features
# Date should be converted from an object to Datetime, and split to YEAR, MONTH, DAY
'DATE'] = pd.to_datetime(features['DATE'])
features[
'YEAR'] = features['DATE'].dt.year
features['MONTH'] = features['DATE'].dt.month
features['DAY'] = features['DATE'].dt.day
features[
= features.dropna() features_final
1.5 View the Cleaned Dataset for Modeling
10) features_final.head(
WEEKDAY | DATE | SLEEP SCORE | HOURS OF SLEEP | REM SLEEP | DEEP SLEEP | HEART RATE BELOW RESTING | Hours REM Sleep | Hours Deep Sleep | Hours HRBR | Total Sleep Time | Sleep Rate | YEAR | MONTH | DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Monday | 2021-11-01 | 88.0 | 8.100000 | 0.20 | 0.13 | 0.84 | 1.620000 | 1.053000 | 6.804000 | 9.216667 | 0.878843 | 2021 | 11 | 1 |
2 | Tuesday | 2021-11-02 | 83.0 | 7.950000 | 0.12 | 0.18 | 0.90 | 0.954000 | 1.431000 | 7.155000 | 9.250000 | 0.859459 | 2021 | 11 | 2 |
3 | Wednesday | 2021-11-03 | 81.0 | 7.100000 | 0.13 | 0.22 | 0.93 | 0.923000 | 1.562000 | 6.603000 | 8.216667 | 0.864097 | 2021 | 11 | 3 |
4 | Thursday | 2021-11-04 | 86.0 | 7.066667 | 0.19 | 0.17 | 0.97 | 1.342667 | 1.201333 | 6.854667 | 8.016667 | 0.881497 | 2021 | 11 | 4 |
5 | Friday | 2021-11-05 | 81.0 | 9.400000 | 0.17 | 0.15 | 0.66 | 1.598000 | 1.410000 | 6.204000 | 10.783333 | 0.871716 | 2021 | 11 | 5 |
6 | Saturday | 2021-11-06 | 77.0 | 8.316667 | 0.14 | 0.13 | 0.21 | 1.164333 | 1.081167 | 1.746500 | 9.400000 | 0.884752 | 2021 | 11 | 6 |
7 | Sunday | 2021-11-07 | 86.0 | 7.616667 | 0.20 | 0.15 | 0.80 | 1.523333 | 1.142500 | 6.093333 | 7.650000 | 0.995643 | 2021 | 11 | 7 |
8 | Monday | 2021-11-08 | 81.0 | 6.766667 | 0.16 | 0.17 | 0.89 | 1.082667 | 1.150333 | 6.022333 | 7.833333 | 0.863830 | 2021 | 11 | 8 |
9 | Tuesday | 2021-11-09 | 88.0 | 7.166667 | 0.24 | 0.19 | 0.98 | 1.720000 | 1.361667 | 7.023333 | 8.033333 | 0.892116 | 2021 | 11 | 9 |
10 | Wednesday | 2021-11-10 | 84.0 | 7.266667 | 0.21 | 0.17 | 0.96 | 1.526000 | 1.235333 | 6.976000 | 8.216667 | 0.884381 | 2021 | 11 | 10 |
1.6 Feature Selection: Final
After iterating through potential features and refining my dataset, I defined the feature set (X) and target variable (y). The feature set included all relevant predictors except for the SLEEP SCORE
, which was the target variable, and columns like DATE
, DAY
, WEEKDAY
, and YEAR
which were either redundant or not directly useful for prediction. Next, I split the data into training and testing sets to evaluate the model’s performance, using a 70-30 split, with 70% of the data used for training and 30% for testing.
To prepare the data for modeling, I created a column transformer that handled both categorical and numerical features. Categorical features were one-hot encoded to convert them into a numerical format, while numerical features were standardized to ensure they were on a comparable scale.
# Define X and Y sets
= features_final.drop(["SLEEP SCORE", "DATE","DAY","WEEKDAY","YEAR"], axis = 1)
X = features_final["SLEEP SCORE"]
y
= train_test_split(X, y, test_size=0.3, random_state= 2)
X_train, X_test, y_train, y_test
# Model 1 - Dummify everything needed and scale numerical columns, All Predictors
= ColumnTransformer(
ct_all
["dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
(=object)),
make_column_selector(dtype_include"standardize", StandardScaler(), make_column_selector(dtype_include=np.float64)) #keep int64 as is
(
],= "passthrough"
remainder = "pandas") ).set_output(transform
1.7 Define the Base Models
# Define base models within the preprocessing pipeline to apply consistent scaling/dummification
= [
base_models 'knn', Pipeline([
('preprocessor', ct_all),
('regressor', KNeighborsRegressor(n_neighbors=15)) #experimented with different values 5,10,15,20
(
])),'svr', Pipeline([
('preprocessor', ct_all),
('regressor', SVR(kernel='linear')) # linear kernel performed best
(
])),'bagging_tree', Pipeline([
('preprocessor', ct_all),
('regressor', BaggingRegressor(n_estimators=100, random_state=2)) #n_estimators was optimal at 100
(
])),'random_forest', Pipeline([
('preprocessor', ct_all),
('regressor', RandomForestRegressor(n_estimators=100, random_state=2)) #n_estimators was optimal at 100
(
])),'gradient_boosting', Pipeline([
('preprocessor', ct_all),
('regressor', GradientBoostingRegressor(n_estimators=100, random_state=2)) #n_estimators was optimal at 100
(
])),'ridge', Pipeline([
('preprocessor', ct_all),
('regressor', Ridge(alpha=0.8)) # experimented with 0.5 - 1.0, + 0.1
(
])) ]
1.8 Model Results: Mean Absolute Error
# Define the final regressor method, typically Linear or Logistic, makes sense to use Linear here
= LinearRegression()
final_regressor
# Create the stacking ensemble
= StackingRegressor(estimators=base_models, final_estimator=final_regressor, cv=5)
stacking_regressor
# Fit the stacking model on the training data
stacking_regressor.fit(X_train, y_train)
# Predict/evaluate the model on the Test data
= stacking_regressor.predict(X_test)
stacking_predictions = mean_absolute_error(y_test, stacking_predictions)
stacking_mae
# Stacking model performance
print("Stacking Model MAE: ", stacking_mae)
Stacking Model MAE: 1.3524256168789923
1.9 Conclusion
Following the utilization of several Bagging methods, used to create a Stacked model, I was able to reduce the mean absolute error down to 1.352, which is a significant achievement when taking into account that Sleep Score scores typically range from 70-90 (indicating a small error). In terms of reverse engineering a formula used by FitBit to calculate our Sleep Score, I landed upon several features that can be used to accurately predict Sleep Score:
To add some predictive power to the model, I had to create and transform some of the existing variables in the set, such as splitting the Date into DAY
, YEAR
, MONTH
, columns to be used later (only Month
proved useful), or wrangling the percentage variables to be in the form of two digit decimals to allow for easy computation with Python. Once complete, I was able to multiply Hours of Sleep
by these floats to get an overall hours value of each sleep type. Lastly, I noticed that the sleep time was longer than the actual Hours of Sleep
for every individual, leading me to believe a % or Sleep Rate
should be calculated to give a sense of sleep efficiency. With these features, removing any duplicates or similar values that appear in multiple columns, my stacking model with Linear Regression as a final estimator, and Ridge, SVR, Gradient Boosting, Bagging, Random Forest, and KNN as base models, was able to accurately predict Sleep Score.