Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/progdrummer1/bellabeat-a-research-into-fitness-metrics-using-r

Study Case: BellaBeat, Data-analysis in R
https://github.com/progdrummer1/bellabeat-a-research-into-fitness-metrics-using-r

Last synced: about 1 month ago
JSON representation

Study Case: BellaBeat, Data-analysis in R

Host: GitHub
URL: https://github.com/progdrummer1/bellabeat-a-research-into-fitness-metrics-using-r
Owner: Progdrummer1
Created: 2024-12-19T14:40:06.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-12-20T13:37:06.000Z (about 1 month ago)
Last Synced: 2024-12-20T14:37:57.789Z (about 1 month ago)
Language: R
Size: 87.9 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

# BellaBeat: A research into fitness metrics using R

![image](https://github.com/user-attachments/assets/dc8f405c-f882-44e4-96ff-42b2d68641a6)

**File summary**
[00 Source Datasets.md](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/00%20Source%20Datasets.md)
[01 Data Exploration and Cleaning Daily Activity Dataset.r](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/01%20Data%20Exploration%20and%20Cleaning%20Daily%20Activity%20Dataset.r)
[02 Data Exploration and Cleaning Sleep Dataset.r](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/02%20Data%20Exploration%20and%20Cleaning%20Sleep%20Dataset)
[03 Data Analysis Daily Activity Dataset.r](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/03%20Data%20Analysis%20Daily%20Activity%20Dataset.r)
[04 Merging Datasets.r](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/04%20Merging%20Datasets.R)
[05 Data Analysis Sleep and Daily Activity.r](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/05%20Data%20Analysis%20Sleep%20and%20Daily%20Activity)
[06 Changelog.md](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/Case_Study_2_%20Changelog.md)

# Introduction
In this case study BellaBeat has hired me as a junior data analyst. BellaBeat is a high-tech manufacturer of health-focused
products for women. They sell several smart devices tracking activity, sleep, stress, and reproductive health. My task is to identify trends found in fitness data gathered by Fitbit. By doing this they hope to be able to gather valuable information to improve their own marketing strategy.

To answer this question, the following steps will be undertaken:
**Ask:** What is the business question that has to be addressed and how to get to the answer?
**Prepare**: How to gather good quality data to answer this question?
**Process:** Clean and prepare data for analysis.
**Analyze:** Analyze the data to get meaningful results, relating to the question to be answered.
**Share:** Make compelling visuals in line with what you want to convey to your stakeholders.
**Act:** Give recommendations to your stakeholders.

# Ask

As stated before, in this report trends in Fitbit data have been identified so BellaBeat can improve their marketing campaign. More broadly, the following questions are addressed in this report:

1\. What are some trends in smart device usage?
2\. How could these trends apply to BellaBeat customers?
3\. How could these trends help influence BellaBeat marketing strategy?

I will answer these questions by cleaning and analyzing data from 30 Fitbit users. By identifying trends in this data and identifying how they apply to BellaBeat users I can inform BellaBeat how to apply these trends to their marketing strategy.

Thera are two stakeholders for which this report is made and will be presented to. First, there is Urška Sršen: BellaBeat’s co-founder and Chief Creative Officer. And second, there is Sando Mur, who is a Mathematician and BellaBeat’s co-founder, therefore a key member of the BellaBeat executive team.

# Prepare

**Data Source**
This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 12 March 2016 and 12 May 2016. The data from the used dataset was obtained from 12 March 00:00 AM till 13 May 08:00 AM. The data is locally stored in the folder "...\\Google Course Data\\Capstone\_Project\_Bellabeat\\Data".
A extensive description of the data source can be found here: [00 Source Datasets.md](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/00%20Source%20Datasets.md)

**Data Validity**
There were only 30 participants in this dataset, this is quite a small data set and no strong conclusions can be drawn from it. It could also be that these respondents are not representative for the average personal tracker user, since mainly young and internet savvy people will be online doing these types of surveys. This is something to keep in mind, because it could bias the analysis. Different Fitbit trackers were used to gather this data. This is very important to note, since some of the trends found in the data could be due to different Fitbit trackers. The data comes from 2016. This is 8 years old and could therefore be outdated. The data has only been gathered in only two consecutive months, between 12 March 2016 and 12 May 2016. The data comes directly from the user’s Fitbit, so the data is original.

**Daily Activity Dataset**
The data is organized in a long format. The data description states that there are 30 respondents, while some databases show 34 different Id’s. It is not clear where this discrepancy comes from. This dataset is quite comprehensive, since it contains: Id, Activity Date, Total Steps, Total Distance, Tracker Distance, Logged Activities Distance, Very Active Distance, Moderately Active Distance, Light Active Distance, Sedentary Active Distance, Very Active Minutes, Fairly Active Minutes, Lightly Active Minutes, Sedentary Minutes. The maximum and minimum values of each column seem reasonable.

```r
#Checking max values per column
dActivity_both_max_values <- apply(dActivity_both, 2, max)
print(dActivity_both_max_values)

#Id ActivityDate TotalSteps
#"8877689391" "5/9/2016" "36019"
#TotalDistance TrackerDistance Logged ActivitiesDistance
#"28.03" "28.03" "4.942142"
#VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
#"21.92" "6.48" "10.71"
#SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
#"0.11" "210" "143"
#LightlyActiveMinutes SedentaryMinutes Calories
#"518" "1440" "4900"

#these all seem reasonable results within the realistic limit of each variable.

#Checking min values per column
dActivity_both_min_values <- apply(dActivity_both, 2, min)
print(dActivity_both_min_values)
# Id ActivityDate TotalSteps
# "1503960366" "2016-03-12" " 0"
# TotalDistance TrackerDistance LoggedActivitiesDistance
# " 0.00" " 0.00" "0.000000"
# VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
# " 0.00" "0.00" " 0.00"
# SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
# "0.00" " 0" " 0"
# LightlyActiveMinutes SedentaryMinutes Calories
# " 0" " 0" " 0"
#some 0 values need to be evaluated, since no one burns 0 calories per day.
```

**Sleep Dataset**
This dataset contains the following variables: "Id", "SleepDay", "TotalSleepRecord", "TotalMinutesAsleep", "TotalTimeInBed". The maximum and minimum values of each column seem reasonable.

```r
print(sleepDay_max_values)
# Id day TotalSleepRecords TotalMinutesAsleep
# "8792009665" "5/9/2016 12:00:00 AM" "3" "796"
# TotalTimeInBed
# "961"

#these all seem reasonable results within the realistic limit of each variable, although 16 hours spend in bed is a bit long, maybe someone was sick.

#checking min values
sleepDay_min_values <- apply(sleepDay_no_duplicates, 2, min)
print(sleepDay_min_values)
# Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
# "1503960366" "4/12/2016 12:00:00 AM" "1" " 58" " 61"
#61 minutes in bed is short, but possible as the minimum.
```

# Process

**Changelog**
All the changes made to the data have been captured in this changelog: [Case_Study_2_ Changelog.md](https://github.com/Progdrummer1/BellaBeat-Using-R-to-research-relations-in-fitness-metrics./blob/15992ef0c4c4c3a9b4b4e5fc750e75b2f183a794/Case_Study_2_%20Changelog.md)

**Activity Data**
Using R, the 'dailyActivity_merged.csv' containing data from March/April and 'dailyActivity_merged2.csv' containing data from March/April have been merged.

```r
#import Daily activity data sets
dActivity<- read.csv(".../Capstone_Project_Bellabeat/Data/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv")
dActivity2 <- read.csv(".../Capstone_Project_Bellabeat/Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

#Merging the tables from March/April and April/May together
dActivity_both <- rbind(dActivity, dActivity2)
```

Some days with only 0 values were found in the Daily Activity database and removed these. These were probably the days when the device wasn’t used.

```r
#Checking for reliability
View(dActivity_both)
#Turns out sometimes there are days where there was no input and this has be interpreted as 0 on every variable.
```

Some duplicates were round on the last day of the first dataset and the first day of the second dataset. These will be the records from the last day of the first set, since in the dataset, showing the steps taken per hour, one can see that that day start from 12AM, so the whole day is registered.

```r
#checking for any duplicates with id and ActivityDate columns combinations.
sum(duplicated(dActivity_both[, c("ActivityDate", "Id")]))
#24 before removing them!!
#these duplicates sneaked in here because it is the end of the first data set and the start of the second!

#Removing overlapping date
dActivity <- dActivity[dActivity$ActivityDate != "4/12/2016", ]
#Merging the tables from March/April and April/May together
dActivity_both <- rbind(dActivity, dActivity2)
```

“ActivityDate” was incorrectly categorized as “Character”, so these were correctedto “Date”. The dataset has been checked for empty and NULL values, which were not present.

```r
#checking data types
dActivity_both_data_types <- sapply(dActivity_both, class)
print(dActivity_both_data_types)
#Id ActivityDate TotalSteps
#"double" "character" "integer"
#TotalDistance TrackerDistance LoggedActivities Distance
#"double" "double" "double"
#VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
#"double" "double" "double"
#SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
#"double" "integer" "integer"
#LightlyActiveMinutes SedentaryMinutes Calories
#"integer" "integer" "integer"
#transform ActivityDate into Date

#Changing to date data type
dActivity_both$ActivityDate <- as.Date(dActivity_both$ActivityDate, format = "%m/%d/%Y")
dActivity_both_data_types <- sapply(dActivity_both, class)
print(dActivity_both_data_types)

#Id ActivityDate TotalSteps
#"double" "Date" "integer"
#TotalDistance TrackerDistance LoggedActivities Distance
#"double" "double" "double"
#VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
#"double" "double" "double"
#SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
#"double" "integer" "integer"
#LightlyActiveMinutes SedentaryMinutes Calories
#"integer" "integer" "integer"
#transform ActivityDate into Date
```

Checked if all dates have the same length.

```r
#check if all dates values have the same length.
length_check <- sapply(dActivity_both$ActivityDate, nchar)
all(length_check == length_check[1]) # TRUE
```

Checked for empty and NULL values.

```r
library(dplyr)

#any empty values?
any(is.na(dActivity_both))
#False

#any null values?
any(is.null(dActivity_both))
#false
```

There were multiple rows giving exactly "1440" (minutes) sedentary activity, which is 24 hours. This is of course possible, but the other variables in these rows also gave very weird results, so these were removed.

```r
dActivity_both <- dActivity_both %>%
filter(SedentaryMinutes != 1440)
```

**Sleep dataset**
Renamed the “SleepDay” column to “Day” to avoid ambiguity with the dataset name “sleepDay”.
```r
#renaming ambigue column name
sleepDay <-sleepDay %>%
rename(day=SleepDay)
```

Removed 3 duplicates from the sleepDay dataset.
```r
#checking for any duplicates with id and day columns combinations.
sum(duplicated(sleepDay[, c("day", "Id")]))
#3,
#so there are three id's which have two dates.

# Find the duplicated rows based on 'day' and 'Id'
sleep_duplicates <- sleepDay[duplicated(sleepDay[, c("day", "Id")]), ]

# Viewing the duplicated records, they all contain exactly the same values, the duplicates will be removed in this script
library(dplyr)
sleepDay_no_duplicates <- sleepDay %>%
distinct(day, Id, .keep_all = TRUE)
```

Checked for empty and NULL values, which were not present.
```r
#any empty values?
any(is.na(sleepDay_no_duplicates))
#False

#any null values?
any(is.null(sleepDay_no_duplicates))
#False
```

Removed the times from the "Day" column, to only keep the date.

```r
#keep only the dates, not the times
sleepDay_no_duplicates$day <- substr(sleepDay_no_duplicates$day, 1, 10)
```

# Analyze/Share
**Check for normality**
The variables were checked whether "calories" and "amount of sleep" are normally distributed. However, they were not. Therefore, only the non-parametric Spearman’s correlation test was used.

```r
shapiro.test(merged_data$SedentaryMinutes) #<-- #not normally distributed
shapiro.test(merged_data$Calories) #<-- #not normally distributed
```

**Correlations between calories and other variables**
The spearman’s correlations were calculated with calories and the other variables, to see what influences the amount of calories spent the most. One can see that the strongest correlations were with "TotalSteps", "TotalDistance" and "VeryActiveMinutes".

```r
#calculating the Spearman's correlation between the variable 'calories' and other variables.
numeric_df <- dActivity_both[sapply(dActivity_both, is.numeric)]
cor_matrix <- cor(numeric_df, method = "spearman", use = "complete.obs")
print(cor_matrix)
# rm(numeric_df)

#Testing significance of highest correlations
cor.test(dActivity_both$TotalSteps,dActivity_both$Calories, method = "spearman")
#p-value < 2.2e-16, highly significant
#cor 0.58, moderate relation.

cor.test(dActivity_both$SedentaryMinutes,dActivity_both$Calories)
#cor -0.1278156, small negative relation
#p-value = 2.021e-06, highly significant.

cor.test(dActivity_both$TotalSteps ,dActivity_both$Calories)
# p = 2.2e-16, highly significant
# cor = 0.6295828,
# Correlation between total steps and calories is highly significant, and modeartely positive.
```

Correlations_With_Calories

**Relation between calories and very active minutes**
Since the amount of calories is not normally distributed, only non-parametric tests were used. Spearman’s correlation test gives a correlation of 0.52 with a p-value \< 2.2e-16 (see script above). This means there is a moderate monotonic relation between calories and the amount of steps taken.

```r
ggplot(data = merged_data, aes(x = VeryActiveMinutes, y = Calories))+ geom_point()
```

Very_Active_Calories

**Relation between calories and steps**
The correlation between calories and total steps were calculated(see script above). Spearman’s correlation test gives a correlation of 0.55 with a p-value \< 2.2e-16. This means there is a moderate monotonic relation between calories and the amount of steps taken.

```r
ggplot(data = merged_data, aes(x = TotalSteps, y = Calories))+ geom_point()
```

Steps_Calories

**TThe daily activity data set and the sleep data set were merged**
```r
#Join dActivity_both and sleepDay_no_duplicates
library(dplyr)

# dActivity_both$Id_Date <- c(dActivity_both$Id, dActivity_both$ActivityDate)
dActivity_both$Id_Date <- paste(dActivity_both$Id, dActivity_both$ActivityDate)
sleepDay_no_duplicates$Id_Date <- paste(sleepDay_no_duplicates$Id, sleepDay_no_duplicates$day)

merged_data <- dActivity_both %>% left_join(sleepDay_no_duplicates, by = "Id_Date")
```

**Relation between sleep and sedentary activity**
Also, the correlation between the amount of minutes slept and the amount of sedentary activity were calculated. The amount of minutes slept is not normally distributed, therefore the Spearman’s correlation was used. This gave a correlation of \-0.57, with p-value \<2.2e-16, this means there is a moderate negative monotonic relation between the amount of minutes slept and the amount of sedentary activity.
```r
cor.test(merged_data$TotalMinutesAsleep, merged_data$SedentaryMinutes, method = "spearman")
p-value < 2.2e-16, rho, -0.5684106, Signifiant relation!
#There is a moderate negative relationship between these two variables.
```

```r
ggplot(na.omit(merged_data), aes(x = TotalMinutesAsleep, y = SedentaryMinutes))+ geom_point()
```

![Sleep_Sedentary](https://github.com/user-attachments/assets/5e4cb281-cbfb-4449-b90c-2aa67f4869a8)

**Relation between sleep and calories**
Also the relation between amount of sleep and calories by the use of Spearman's correlation were calculated. However, this relation was not significant, which is not surprising as one sees the bar chart.

```r
cor.test(merged_data$TotalMinutesAsleep, merged_data$Calories, method = "spearman")
#p-value = 0.4366, rho -0.03852297 not significant
```

```r
ggplot(na.omit(merged_data), aes(x = TotalMinutesAsleep, y = Calories))+ geom_point()
```
![Sleep_Calories2](https://github.com/user-attachments/assets/8b1b8ce7-1282-40ef-a6b9-e874c63931b6)

**Relation between steps taken and weight change**
It was impossible to calculate the relation between steps taken and weight change due to there only being 8 participants in the 'weight change dataset', which is too little to draw any conclusions from.

The data set description states that different Fitbit types were used in this dataset. This was not noticed during the analyis, but this could have influenced the results nonetheless.

# Act

Let’s get back to the initial research questions:
**1\. What are some trends in smart device usage?**
**2\. How could these trends apply to BellaBeat customers?**
**3\. How could these trends help influence BellaBeat marketing strategy?**

**1\. What are some trends in smart device usage?**
The following monotonic relations have been found:

* A moderate positive relation between ‘total steps’ and ‘calories’.
* A moderate positive relation ‘very active minutes’ and ‘calories’.
* A moderate negative relation between ‘sleep’ and ‘sedentary activity’.

This data suggests that on days when one walks or spends more ‘very active’ minutes more one tends to burn more calories. It cannot be said whether this is a linear relation or not. It can also not be said for certain if this relation is causative. Since burning calories doesn’t cause any activity, but is caused by activity one could assume that there is a causative relation. However, there could also be a confounding variable, influencing both.
The same holds true for the monotonic relation between ‘sleep’ and ‘sedentary activity’, therefore more research is suggested.

**2\. How could these trends apply to BellaBeat customers?**
One could use their BellaBeat as a step counter to motivate oneself to take more steps, so one could burn more calories.Also one could use the BellaBeat to motivate oneself to spend more time being very active, so one could burn more calories.Also BellaBeat users could measure one’s sleep, to motivate oneself to be more active the next day.

**3\. How could these trends help influence BellaBeat marketing strategy?**
The BellaBeat team could use these findings to promote the effectiveness of tracking sleep and activity. They could promote how it influences calories burnt and how more sleep has a relation with less sedentary activity.
Right now, BellaBeat collects data on activity, sleep, stress, and reproductive health. They could possibly add a step tracker, since this is correlated with more calories burnt.