Baselight

BellaBeat - Case Study

Studying the wellness trends among FitBit users

@kaggle.elijahtoumoua_bellabeat

Loading...
Loading...

About this Dataset

BellaBeat - Case Study

Disclaimers

I did not use R to organize and visualize the data, I used SQL and Tableau. The query can be found here and the Tableau dashboards can be found here. At the time of writing this, I do not claim any ownership to the data nor am I affiliated with BellaBeat in any way, this case study is just an opportunity for me to demonstrate my skills with data analysis.

Prelude

The purpose of this capstone project is to analyze and interpret the data from Bellabeat, a product manufacturer focused on women’s health. In 2016, they released a FitBit app that tracks metrics such as how long the user has been sleeping, the estimated burned calories, etc. Urška Sršen, the Chief Creative Officer of Bellabeat, wishes to explore and interpret this data to see if there are any opportunities for growth. In this scenario, I am an analyst working for Bellabeat working to see what trends exist in this data for Srsen and how to capitalize off of it.

Stage I - Ask

Stakeholders

To set the foundation for this project, it’s important that I keep the business task in mind – gain the insights for how Bellabeat users are using the app and how we can use these trends for future marketing strategies. Keeping this in mind, the stakeholders involved in this project are…

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  • The Bellabeat Analytics Team: A team of data analysts assigned to interpret and analyze the data for Bellabeat

Business Task

After reviewing the business task and stakeholders at hand, I thought that the most important step before analyzing the data is to ask the right questions. With that in mind, the question I want to answer is how can BellaBeat use existing trends to improve marketing for its device?

Stage II - Prepare

Information on Data

The data I will be using, which is linked [here] (https://www.kaggle.com/datasets/arashnic/fitbit), contains FitBit data from each user. We will also be focusing on specific metrics in the set. The metadata is linked here but I will list the metrics we will be focusing on this study below:

  • StepTotal: The total number of steps a user has taken in a day.
  • VeryActiveMinutes: Total minutes spent in very active activity.
  • FairlyActiveMinutes: Total minutes spent in fairly active activity.
  • LightActiveMinutes: Total minutes spent in light active activity.
  • SedentaryMinutes: Total minutes spent in sedentary activity.
  • Calories: Estimated calories burnt.
  • TotalSleepTime: Estimated time spent sleeping.
  • TotalTimeInBed: Total time in bed.

The data is stored across 18 CSV files and we will be focusing on 3: the “Activities Merged” file, the "Weight Merged" file and the "Sleep Time Merged" file.

Reliability of Data and Limitations

However, this begs the question of whether this is “good” data. Using the ROCCC method, I came to the following conclusions:

  • Reliable: There are only 33 subjects in the activity file so this has low reliability. The weight and sleep file only had 8 and 28 users respectively, more so they did not log the amount of times they have slept frequently with less than 50% of users logging their sleep times consistently.
  • Original: The data has low originality as it comes from a third-party provider.
  • Comprehensive: As mentioned prior, some users were not consistent with logging their information. More importantly, if one were to look at the weight file, they would notice that most of the rows in the Fat column are null. We could get an estimate if we had more information such as their height or age but we cannot build a proper profile with this information. Lastly, the sample size is too small to be representative of the total population so it has a low comprehensive level.
  • Current: As of writing this, the data comes from 6 years ago, so users may be more/less active right now than they were at the time.
  • Cited: As mentioned prior, the data was collected from a third party source, so unknown.

Overall, the data could be better as it is too small of a sample size to really represent the population. The users who also have consented to giving their data either were too inconsistent or there was not enough data to make a proper conclusion.

Stage III - Process

Examining the Data

I began the process by seeing if there was any null data within the datasets.

SELECT
	*
FROM
	dailyActivity_merged
WHERE
	Id IS NULL OR
	TotalSteps IS NULL OR
	ActivityDate IS NULL OR
	Calories IS NULL OR
	SedentaryMinutes IS NULL OR
	LightActiveDistance IS NULL OR
	FairlyActiveMinutes IS NULL OR
	VeryActiveMinutes IS NULL;

SELECT
	*
FROM 
	sleepDay_merged
WHERE
	Id IS NULL OR
	SleepDay IS NULL OR
	TotalSleepRecords IS NULL OR
	TotalMinutesAsleep IS NULL OR
	TotalTimeInBed IS NULL;

SELECT
	* 
FROM 
	weightLogInfo_merged
WHERE
	Id IS NULL OR
	WHERE
	Id IS NULL OR
	Date IS NULL OR
	WeightKg IS NULL OR
	WeightPounds IS NULL OR
	Fat IS NULL OR
	BMI IS NULL OR
	IsManualReport IS NULL OR
	LogId IS NULL;

There was no null data in the Sleep Day or Activity datasets, however there was an entire column that had null data in the Weight Log Info dataset. To be specific, it was the fat column. Since we only have their weight and BMI, there isn't enough info to estimate their body fat percentage so we will be dismissing it for this analysis. After seeing that there was no null data, I wanted to see the timeframe for the data.

SELECT
	MIN(ActivityDate) AS beginning_date,
	MAX(ActivityDate) AS ending_date
FROM
	dailyActivity_merged;

The timeframe goes from April 12, 2016 to May 12, 2016 so the timeframe is about a month. Up next, I wanted to see the amount of users within each dataset.

SELECT
	COUNT(DISTINCT ID) AS total_users
FROM
	dailyActivity_merged;

SELECT
	COUNT(DISTINCT ID) AS total_users
FROM
	sleepDay_merged;

SELECT
	COUNT(DISTINCT ID) AS total_users
FROM
	weightLogInfo_merged;

There were 33 users in the Activity dataset, 24 in the Sleep Day dataset, and 8 in the Weight Log Info dataset. Since there were only 8 users and an entire column filled with null data we could not estimate, I thought it would be best to remove the dataset from this analysis since I felt that it would not contribute much to the overall analysis and create more inaccuracies in my conclusions. Lastly, I wanted to see the overall times users used or manually inputted their information to get an idea for how often they were actually using the device.

SELECT
	Id,
	COUNT(ActivityDate) AS total_records
FROM	
	dailyActivity_merged
GROUP BY
	Id;

SELECT
	Id,
	COUNT(TotalSleepRecords) AS total_sleep_counted
FROM
	sleepDay_merged
GROUP BY
	Id;

For the Daily Activity dataset, we can see fairly consistent usage with most users being around the 20-30 range. Only one user was noticeably low, only using it 4 times. As for the Sleep Day dataset, the results were more mixed. Less than 50% of users manually inputted their sleep over 20 times.

Organizing the Data

After analyzing the data, I thought it would be best to calculate the average trends among each user as I thought it would provide the best insights for how they are using the FitBit. To start, I calculated the average activity levels for each user. I did think it would be easier to read it in hours instead of minutes, however it returned 0 for lightly active and very active levels, so I opted to use minutes.

SELECT
	AVG(act.VeryActiveMinutes) AS avg_very_active_minutes,
	AVG(act.FairlyActiveMinutes) AS avg_moderate_minutes,
	AVG(act.LightlyActiveMinutes) AS avg_light_minutes,
	AVG(act.SedentaryMinutes) AS avg_sedentary_minutes,
	AVG(act.SedentaryMinutes + act.LightlyActiveMinutes + act.FairlyActiveMinutes + act.VeryActiveMinutes) AS avg_total_minutes
FROM 
	dailyActivity_merged act;

After that, I wanted to see the day-by-day trends for each user.

SELECT
	ActivityDate,
	AVG(VeryActiveMinutes) AS avg_very_active_minutes,
	AVG(FairlyActiveMinutes) AS avg_moderate_minutes,
	AVG(LightlyActiveMinutes) AS avg_light_minutes,
	AVG(SedentaryMinutes) AS avg_sedentary_minutes
FROM
	dailyActivity_merged
GROUP BY
	ActivityDate;

After creating a dataset for the day-by-day trends, I wanted to see if there was any correlation between how active an individual was and their quality of sleep. To do this, I joined the Sleep Day dataset and Activity dataset together. I did try including the sleep records in here to get an idea for which users were consistent, but there was an error in the calculations. The output was much larger than the initial records, so I made a separate table that I could use with tableau later.

SELECT
	act.Id,
	AVG(act.TotalSteps) AS avg_total_steps,
	AVG(act.VeryActiveMinutes) AS avg_very_active_minutes,
	AVG(act.FairlyActiveMinutes) AS avg_moderate_minutes,
	AVG(act.LightlyActiveMinutes) AS avg_light_minutes,
	AVG(act.SedentaryMinutes) AS avg_sedentary_minutes,
	AVG(act.SedentaryMinutes + act.LightlyActiveMinutes + act.FairlyActiveMinutes + act.VeryActiveMinutes) AS avg_total_minutes,
	AVG(act.Calories) AS avg_calories_burned,
	AVG(sleep.TotalMinutesAsleep) AS avg_sleep_time,
	AVG(sleep.TotalTimeInBed) AS avg_time_in_bed
FROM
	sleepDay_merged sleep
	JOIN dailyActivity_merged act
	ON sleep.Id = act.Id
GROUP BY 
	act.Id;

Lastly, I wanted to see how trends were for each weekday. To do this, I created two temporary tables to join together later. I started with the Activity dataset.

SELECT 
	DATENAME(Weekday,ActivityDate) AS Weekday,
	AVG(TotalSteps) AS avg_total_steps,
	AVG(VeryActiveMinutes) AS avg_very_active_minutes,
	AVG(FairlyActiveMinutes) AS avg_moderate_minutes,
	AVG(LightlyActiveMinutes) AS avg_light_minutes,
	AVG(SedentaryMinutes) AS avg_sedentary_minutes,
	AVG(Calories) AS avg_calories_burned
INTO
	Avg_Activity
FROM
	dailyActivity_merged
GROUP BY
	DATENAME(Weekday,ActivityDate);

After the Activity dataset, I made another temporary table with the Sleep Day dataset.

SELECT
	DATENAME(Weekday, SleepDay) AS Weekday,
	AVG(TotalMinutesAsleep) AS avg_minutes_asleep,
	AVG(TotalTimeInBed) AS avg_minutes_in_bed
INTO
	Avg_Sleeptime
FROM 
	sleepDay_merged
GROUP BY
	DATENAME(Weekday,SleepDay);

Lastly, after creating these two temporary tables, I merged them together.

SELECT
	a.Weekday,
	a.avg_total_steps,
	a.avg_very_active_minutes,
	a.avg_moderate_minutes,
	a.avg_light_minutes,
	a.avg_sedentary_minutes,
	a.avg_calories_burned,
	b.avg_minutes_asleep,
	b.avg_minutes_in_bed
FROM
	Avg_Activity a
	JOIN Avg_Sleeptime b
	ON a.Weekday = b.Weekday
ORDER BY
	Weekday DESC

DROP TABLE Avg_Activity
DROP TABLE Avg_Sleeptime;

The datasets for each of these tables can be found below. After cleaning and organizing the data, it was time to create visualizations for further analysis.

Step IV - Analysis

Trends Over Time


After analyzing the data, it became apparent that a majority of users spent most of their time being sedentary. Over 80% of users were sedentary on average with a little over 3% being moderately active or very active. Also, usage began to drop towards the end of the month as there is a steady decline in sedentary activity and light activity. I was concerned reading this since the American Heart Association recommends 150 minutes of moderate activity, 75 minutes of intensive activity or a combination of both. (Can be found here) What this meant was that, in order to reach this, users must spend at least 21.4 minutes being moderately active or 10.7 minutes being very active. I inputted the following code to see if they were following these recommendations:

SELECT
	Id,
	AVG(FairlyActiveMinutes) AS avg_moderate_active_minutes,
	AVG(VeryActiveMinutes) AS avg_very_active_minutes
FROM
	dailyActivity_merged
WHERE
	FairlyActiveMinutes >= 21.4 OR
	VeryActiveMinutes >= 10.7
GROUP BY 
	Id;

The output was that 30 users were active throughout the week and following these recommendations.

Trends with Sleeping


The graphs above show scatterplots between sleep times and activity levels amongst users. Although the correlation begins to decrease as the activity levels increase, we can see there is a negative correlation between minutes spent being sedentary and average time spent sleeping. The more sedentary a user was, the more likely they were to get a lower quality in sleep. The Center for Disease Control and Prevention also recommends that adults receive at least 7 hours of sleep, which is about 420 minutes. (Can be found here) To explain potential outliers that got less than the recommended amount of sleep, I used their sleep records to color code data points and see which users were consistent with reporting their sleeping records. Those who reported consistently did have higher sleep times at higher activity levels while those who did not report as consistently had lower than average sleep times, which may have had an effect on the overall trends since the few times they recorded their sleep times were low.

Calories Burned by Activity


The graphs above show scatterplots between calories burnt and activity levels. According to the US Department of Agriculture, the recommended amount of calories adult women should burn is at least 1600 - 2200 calories. (Can be found here) Comparing the graphs, there is a positive correlation between how active a person is and the amount of calories they burn. Users with low sedentary times and high active times burned more calories than their counterparts.

Correlation Between Total Steps


Above are scatterplots between the average daily steps and sleep time, time in bed, and calories burnt. Studies do vary, but a 2011 study suggests that adults should walk at least 7000 steps daily. (Can be found here To further explain outliers in this portion, I color-coded the datapoints with how sentient the user was. Users with low sentient times appear as green while users with high sentient times appear as red. A majority of the users with low sentient times did sleep above the recommended sleep times and burned above the recommended amount of calories.

Step V - Act

Concluding Thoughts

After analyzing and interpreting the data, I came to the following conclusions:

  • There is a positive correlation between how active a user is and how many calories they burn
  • There is a negative correlation between how sedentary a user is and the quality of their sleep
  • The more steps a user takes, the more calories they burn

Recommendations

Seeing the correlation between the datasets, I recommend the following to BellaBeat:

  • Obtain more data to create a more accurate analysis. This could be done by doing a controlled study between different users and seeing how they utilize the app.
  • Implement a coaching program to help users who are new to fitness. A lot of them may be sedentary for long periods of time since they may not have experience or knowledge in health or wellness trends. This coaching program can encourage users to be more active and create a better lifestlye for themselves.
  • Have an alarm within the BellaBeat app that alerts users if they are sedentary for too long. This will encourage users to be more active and take more daily steps.
  • Encourage users to be consistent with recording the amount of times they sleep by implementing a reward system. They could receive points that can go towards BellaBeat products and services.
  • Include more information in the scenario there is null data. This could be done by including the users age, height, and other information.

Tables

Bellabeatind

@kaggle.elijahtoumoua_bellabeat.bellabeatind
  • 8.32 KB
  • 24 rows
  • 9 columns
Loading...

CREATE TABLE bellabeatind (
  "id" BIGINT,
  "avg_total_steps" BIGINT,
  "avg_very_active_minutes" BIGINT,
  "avg_moderate_minutes" BIGINT,
  "avg_light_minutes" BIGINT,
  "avg_sedentary_minutes" BIGINT,
  "avg_calories_burned" BIGINT,
  "avg_sleep_time" BIGINT,
  "avg_time_in_bed" BIGINT
);

Bellabeatweekday

@kaggle.elijahtoumoua_bellabeat.bellabeatweekday
  • 7.6 KB
  • 7 rows
  • 9 columns
Loading...

CREATE TABLE bellabeatweekday (
  "weekday" VARCHAR,
  "avg_total_steps" BIGINT,
  "avg_very_active_minutes" BIGINT,
  "avg_moderate_minutes" BIGINT,
  "avg_light_minutes" BIGINT,
  "avg_sedentary_minutes" BIGINT,
  "avg_calories_burned" BIGINT,
  "avg_minutes_asleep" BIGINT,
  "avg_minutes_in_bed" BIGINT
);

Datetimebellabeat

@kaggle.elijahtoumoua_bellabeat.datetimebellabeat
  • 5.21 KB
  • 31 rows
  • 5 columns
Loading...

CREATE TABLE datetimebellabeat (
  "activitydate" TIMESTAMP,
  "avg_very_active_minutes" BIGINT,
  "avg_moderate_minutes" BIGINT,
  "avg_light_minutes" BIGINT,
  "avg_sedentary_minutes" BIGINT
);

Share link

Anyone who has the link will be able to view this.