BellaBeat - Case Study
Studying the wellness trends among FitBit users
@kaggle.elijahtoumoua_bellabeat
Studying the wellness trends among FitBit users
@kaggle.elijahtoumoua_bellabeat
I did not use R to organize and visualize the data, I used SQL and Tableau. The query can be found here and the Tableau dashboards can be found here. At the time of writing this, I do not claim any ownership to the data nor am I affiliated with BellaBeat in any way, this case study is just an opportunity for me to demonstrate my skills with data analysis.
The purpose of this capstone project is to analyze and interpret the data from Bellabeat, a product manufacturer focused on women’s health. In 2016, they released a FitBit app that tracks metrics such as how long the user has been sleeping, the estimated burned calories, etc. Urška Sršen, the Chief Creative Officer of Bellabeat, wishes to explore and interpret this data to see if there are any opportunities for growth. In this scenario, I am an analyst working for Bellabeat working to see what trends exist in this data for Srsen and how to capitalize off of it.
To set the foundation for this project, it’s important that I keep the business task in mind – gain the insights for how Bellabeat users are using the app and how we can use these trends for future marketing strategies. Keeping this in mind, the stakeholders involved in this project are…
After reviewing the business task and stakeholders at hand, I thought that the most important step before analyzing the data is to ask the right questions. With that in mind, the question I want to answer is how can BellaBeat use existing trends to improve marketing for its device?
The data I will be using, which is linked [here] (https://www.kaggle.com/datasets/arashnic/fitbit), contains FitBit data from each user. We will also be focusing on specific metrics in the set. The metadata is linked here but I will list the metrics we will be focusing on this study below:
The data is stored across 18 CSV files and we will be focusing on 3: the “Activities Merged” file, the "Weight Merged" file and the "Sleep Time Merged" file.
However, this begs the question of whether this is “good” data. Using the ROCCC method, I came to the following conclusions:
Overall, the data could be better as it is too small of a sample size to really represent the population. The users who also have consented to giving their data either were too inconsistent or there was not enough data to make a proper conclusion.
I began the process by seeing if there was any null data within the datasets.
SELECT
*
FROM
dailyActivity_merged
WHERE
Id IS NULL OR
TotalSteps IS NULL OR
ActivityDate IS NULL OR
Calories IS NULL OR
SedentaryMinutes IS NULL OR
LightActiveDistance IS NULL OR
FairlyActiveMinutes IS NULL OR
VeryActiveMinutes IS NULL;
SELECT
*
FROM
sleepDay_merged
WHERE
Id IS NULL OR
SleepDay IS NULL OR
TotalSleepRecords IS NULL OR
TotalMinutesAsleep IS NULL OR
TotalTimeInBed IS NULL;
SELECT
*
FROM
weightLogInfo_merged
WHERE
Id IS NULL OR
WHERE
Id IS NULL OR
Date IS NULL OR
WeightKg IS NULL OR
WeightPounds IS NULL OR
Fat IS NULL OR
BMI IS NULL OR
IsManualReport IS NULL OR
LogId IS NULL;
There was no null data in the Sleep Day or Activity datasets, however there was an entire column that had null data in the Weight Log Info dataset. To be specific, it was the fat column. Since we only have their weight and BMI, there isn't enough info to estimate their body fat percentage so we will be dismissing it for this analysis. After seeing that there was no null data, I wanted to see the timeframe for the data.
SELECT
MIN(ActivityDate) AS beginning_date,
MAX(ActivityDate) AS ending_date
FROM
dailyActivity_merged;
The timeframe goes from April 12, 2016 to May 12, 2016 so the timeframe is about a month. Up next, I wanted to see the amount of users within each dataset.
SELECT
COUNT(DISTINCT ID) AS total_users
FROM
dailyActivity_merged;
SELECT
COUNT(DISTINCT ID) AS total_users
FROM
sleepDay_merged;
SELECT
COUNT(DISTINCT ID) AS total_users
FROM
weightLogInfo_merged;
There were 33 users in the Activity dataset, 24 in the Sleep Day dataset, and 8 in the Weight Log Info dataset. Since there were only 8 users and an entire column filled with null data we could not estimate, I thought it would be best to remove the dataset from this analysis since I felt that it would not contribute much to the overall analysis and create more inaccuracies in my conclusions. Lastly, I wanted to see the overall times users used or manually inputted their information to get an idea for how often they were actually using the device.
SELECT
Id,
COUNT(ActivityDate) AS total_records
FROM
dailyActivity_merged
GROUP BY
Id;
SELECT
Id,
COUNT(TotalSleepRecords) AS total_sleep_counted
FROM
sleepDay_merged
GROUP BY
Id;
For the Daily Activity dataset, we can see fairly consistent usage with most users being around the 20-30 range. Only one user was noticeably low, only using it 4 times. As for the Sleep Day dataset, the results were more mixed. Less than 50% of users manually inputted their sleep over 20 times.
After analyzing the data, I thought it would be best to calculate the average trends among each user as I thought it would provide the best insights for how they are using the FitBit. To start, I calculated the average activity levels for each user. I did think it would be easier to read it in hours instead of minutes, however it returned 0 for lightly active and very active levels, so I opted to use minutes.
SELECT
AVG(act.VeryActiveMinutes) AS avg_very_active_minutes,
AVG(act.FairlyActiveMinutes) AS avg_moderate_minutes,
AVG(act.LightlyActiveMinutes) AS avg_light_minutes,
AVG(act.SedentaryMinutes) AS avg_sedentary_minutes,
AVG(act.SedentaryMinutes + act.LightlyActiveMinutes + act.FairlyActiveMinutes + act.VeryActiveMinutes) AS avg_total_minutes
FROM
dailyActivity_merged act;
After that, I wanted to see the day-by-day trends for each user.
SELECT
ActivityDate,
AVG(VeryActiveMinutes) AS avg_very_active_minutes,
AVG(FairlyActiveMinutes) AS avg_moderate_minutes,
AVG(LightlyActiveMinutes) AS avg_light_minutes,
AVG(SedentaryMinutes) AS avg_sedentary_minutes
FROM
dailyActivity_merged
GROUP BY
ActivityDate;
After creating a dataset for the day-by-day trends, I wanted to see if there was any correlation between how active an individual was and their quality of sleep. To do this, I joined the Sleep Day dataset and Activity dataset together. I did try including the sleep records in here to get an idea for which users were consistent, but there was an error in the calculations. The output was much larger than the initial records, so I made a separate table that I could use with tableau later.
SELECT
act.Id,
AVG(act.TotalSteps) AS avg_total_steps,
AVG(act.VeryActiveMinutes) AS avg_very_active_minutes,
AVG(act.FairlyActiveMinutes) AS avg_moderate_minutes,
AVG(act.LightlyActiveMinutes) AS avg_light_minutes,
AVG(act.SedentaryMinutes) AS avg_sedentary_minutes,
AVG(act.SedentaryMinutes + act.LightlyActiveMinutes + act.FairlyActiveMinutes + act.VeryActiveMinutes) AS avg_total_minutes,
AVG(act.Calories) AS avg_calories_burned,
AVG(sleep.TotalMinutesAsleep) AS avg_sleep_time,
AVG(sleep.TotalTimeInBed) AS avg_time_in_bed
FROM
sleepDay_merged sleep
JOIN dailyActivity_merged act
ON sleep.Id = act.Id
GROUP BY
act.Id;
Lastly, I wanted to see how trends were for each weekday. To do this, I created two temporary tables to join together later. I started with the Activity dataset.
SELECT
DATENAME(Weekday,ActivityDate) AS Weekday,
AVG(TotalSteps) AS avg_total_steps,
AVG(VeryActiveMinutes) AS avg_very_active_minutes,
AVG(FairlyActiveMinutes) AS avg_moderate_minutes,
AVG(LightlyActiveMinutes) AS avg_light_minutes,
AVG(SedentaryMinutes) AS avg_sedentary_minutes,
AVG(Calories) AS avg_calories_burned
INTO
Avg_Activity
FROM
dailyActivity_merged
GROUP BY
DATENAME(Weekday,ActivityDate);
After the Activity dataset, I made another temporary table with the Sleep Day dataset.
SELECT
DATENAME(Weekday, SleepDay) AS Weekday,
AVG(TotalMinutesAsleep) AS avg_minutes_asleep,
AVG(TotalTimeInBed) AS avg_minutes_in_bed
INTO
Avg_Sleeptime
FROM
sleepDay_merged
GROUP BY
DATENAME(Weekday,SleepDay);
Lastly, after creating these two temporary tables, I merged them together.
SELECT
a.Weekday,
a.avg_total_steps,
a.avg_very_active_minutes,
a.avg_moderate_minutes,
a.avg_light_minutes,
a.avg_sedentary_minutes,
a.avg_calories_burned,
b.avg_minutes_asleep,
b.avg_minutes_in_bed
FROM
Avg_Activity a
JOIN Avg_Sleeptime b
ON a.Weekday = b.Weekday
ORDER BY
Weekday DESC
DROP TABLE Avg_Activity
DROP TABLE Avg_Sleeptime;
The datasets for each of these tables can be found below. After cleaning and organizing the data, it was time to create visualizations for further analysis.
After analyzing the data, it became apparent that a majority of users spent most of their time being sedentary. Over 80% of users were sedentary on average with a little over 3% being moderately active or very active. Also, usage began to drop towards the end of the month as there is a steady decline in sedentary activity and light activity. I was concerned reading this since the American Heart Association recommends 150 minutes of moderate activity, 75 minutes of intensive activity or a combination of both. (Can be found here) What this meant was that, in order to reach this, users must spend at least 21.4 minutes being moderately active or 10.7 minutes being very active. I inputted the following code to see if they were following these recommendations:
SELECT
Id,
AVG(FairlyActiveMinutes) AS avg_moderate_active_minutes,
AVG(VeryActiveMinutes) AS avg_very_active_minutes
FROM
dailyActivity_merged
WHERE
FairlyActiveMinutes >= 21.4 OR
VeryActiveMinutes >= 10.7
GROUP BY
Id;
The output was that 30 users were active throughout the week and following these recommendations.
The graphs above show scatterplots between sleep times and activity levels amongst users. Although the correlation begins to decrease as the activity levels increase, we can see there is a negative correlation between minutes spent being sedentary and average time spent sleeping. The more sedentary a user was, the more likely they were to get a lower quality in sleep. The Center for Disease Control and Prevention also recommends that adults receive at least 7 hours of sleep, which is about 420 minutes. (Can be found here) To explain potential outliers that got less than the recommended amount of sleep, I used their sleep records to color code data points and see which users were consistent with reporting their sleeping records. Those who reported consistently did have higher sleep times at higher activity levels while those who did not report as consistently had lower than average sleep times, which may have had an effect on the overall trends since the few times they recorded their sleep times were low.
The graphs above show scatterplots between calories burnt and activity levels. According to the US Department of Agriculture, the recommended amount of calories adult women should burn is at least 1600 - 2200 calories. (Can be found here) Comparing the graphs, there is a positive correlation between how active a person is and the amount of calories they burn. Users with low sedentary times and high active times burned more calories than their counterparts.
Above are scatterplots between the average daily steps and sleep time, time in bed, and calories burnt. Studies do vary, but a 2011 study suggests that adults should walk at least 7000 steps daily. (Can be found here To further explain outliers in this portion, I color-coded the datapoints with how sentient the user was. Users with low sentient times appear as green while users with high sentient times appear as red. A majority of the users with low sentient times did sleep above the recommended sleep times and burned above the recommended amount of calories.
After analyzing and interpreting the data, I came to the following conclusions:
Seeing the correlation between the datasets, I recommend the following to BellaBeat:
CREATE TABLE bellabeatind (
"id" BIGINT,
"avg_total_steps" BIGINT,
"avg_very_active_minutes" BIGINT,
"avg_moderate_minutes" BIGINT,
"avg_light_minutes" BIGINT,
"avg_sedentary_minutes" BIGINT,
"avg_calories_burned" BIGINT,
"avg_sleep_time" BIGINT,
"avg_time_in_bed" BIGINT
);
CREATE TABLE bellabeatweekday (
"weekday" VARCHAR,
"avg_total_steps" BIGINT,
"avg_very_active_minutes" BIGINT,
"avg_moderate_minutes" BIGINT,
"avg_light_minutes" BIGINT,
"avg_sedentary_minutes" BIGINT,
"avg_calories_burned" BIGINT,
"avg_minutes_asleep" BIGINT,
"avg_minutes_in_bed" BIGINT
);
CREATE TABLE datetimebellabeat (
"activitydate" TIMESTAMP,
"avg_very_active_minutes" BIGINT,
"avg_moderate_minutes" BIGINT,
"avg_light_minutes" BIGINT,
"avg_sedentary_minutes" BIGINT
);
Anyone who has the link will be able to view this.