{"id":24935185,"url":"https://github.com/anthonysanalysis/bellabeat-analysis","last_synced_at":"2026-04-20T03:31:49.387Z","repository":{"id":274555099,"uuid":"921957855","full_name":"AnthonysAnalysis/Bellabeat-Analysis","owner":"AnthonysAnalysis","description":"Bellabeat Tech Case Study Capstone Project","archived":false,"fork":false,"pushed_at":"2025-01-28T00:26:52.000Z","size":2047,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-28T15:15:12.450Z","etag":null,"topics":["analysis","capstone","case-study","data","data-analysis","data-visualization","md","r","rmd","rstudio"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AnthonysAnalysis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-25T00:22:15.000Z","updated_at":"2025-01-28T01:02:25.000Z","dependencies_parsed_at":"2025-01-28T02:23:56.816Z","dependency_job_id":"f0aa3451-391b-434d-b4e9-5132f429551f","html_url":"https://github.com/AnthonysAnalysis/Bellabeat-Analysis","commit_stats":null,"previous_names":["anthonysanalysis/bellabeat-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnthonysAnalysis%2FBellabeat-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnthonysAnalysis%2FBellabeat-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnthonysAnalysis%2FBellabeat-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnthonysAnalysis%2FBellabeat-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AnthonysAnalysis","download_url":"https://codeload.github.com/AnthonysAnalysis/Bellabeat-Analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246049630,"owners_count":20715511,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","capstone","case-study","data","data-analysis","data-visualization","md","r","rmd","rstudio"],"created_at":"2025-02-02T15:21:46.747Z","updated_at":"2026-04-20T03:31:49.314Z","avatar_url":"https://github.com/AnthonysAnalysis.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bellabeat Tech Capstone Project\n#### Author: Anthony Tran\n#### Date: 2024-12-20\n\n# **About:**\n\nBellabeat is a small, successful manufacturer of health-focused tech\nproducts for women. They are looking to expand their market share in the\nglobal smart-device market.\n\nI, as a junior data analyst on the marketing analyst team have been\ntasked with working on one of Bellabeat’s products, their app which\nconnects to their many smart devices. We will be analyzing usage data on\none of Bellabeat’s competitor’s devices, the Fitbit, to gain insight\ninto how customers are using their smart devices, which will help create\ndata-driven decision making for the company’s marketing strategy for\ngrowth.\n\nThroughout the project, we will follow the six main steps of the data\nanalysis process:\n\n**Ask, prepare, process, analyze, share, and act.**\n\n# **Ask -**\n\nBusiness task:\n\nWe will identify usage and data trends in the Fitbit, a handheld\nwearable from one of Bellabeat’s competitors. We will run some data\nanalysis and acquire some insight into this smart device’s usage. Then,\nfigure out how to apply these insights into our own products to drive\nBellabeat’s marketing strategy via data-driven decision making.\n\n# **Prepare -**\n\nDatasource: Our dataset is the Fitbit Fitness Tracker Dataset (CC0:\nPublic Domain), via Mobius.\n\nAccessibility and privacy: The data is open-sourced and licensed under\nCC0: Public Domain in which the original owner relinquished all\ncopyright and similar rights worldwide, and dedicated those rights to\nthe public domain. Others can copy, modify, distribute and perform the\nwork, even for commercial purposes, all without asking permission.\n\nDataset information: This dataset contains data generated from ~ 33\nFitbit users respondents to an Amazon Mechanical Turk survey, who\nconsented to the submission of their personal tracker data from the\ntimeframe 04/12/2016 - 05/12/2016. Various different quantitative data\nmetrics include physical activity, heart rate, and sleep monitoring at\ndifferent levels of scope, including minute-output,hourly, and daily.\nVariation between output represents use of different types of Fitbit\ntrackers and individual tracking behaviors / preferences.\n\nData organization: The data was made available via 18 .csv files, each\ncontaining various different quantitative data metrics.The majority of\nwhich is in long format, with 3 files in wide format. Some datasets were\nexcluded from analysis after verification due to being redundant data\nalready included in a more complete data table, or the sample size for\nparticipant response was too small to generate significant insight.\n\nData limitations: Unknown demographics. We do not know various\ndemographics for our participants such as age or sex, which would be\nimportant especially as it relates to our company which focuses on\nwomen-centric products. Also, some of the data given are not clearly\nlabeled with the proper units, so we had to assume some standard units\nto our best guess, given the context.\n\nSmall sample size and time frame: In statistics, the golden rule for\nanalysis is that carries the most power when the sample size is over 30,\nthe larger the better. For most of our data, we are barely at the bare\nminimum so insight might not be able to be extrapolated strongly to a\nlarger population or timeframe.\n\n# **Process -**\n\nAnalysis will be carried out in programming language R, in Posit Cloud\ndue to ease of use for statistical computing, data visualization, and\nsharing results with stakeholders.\n\n## Setting up my environment\n\nNotes: We’ll begin by installing and loading the necessary packages to\nrun our analysis:\n\ntidyverse, dplyr, here, janitor, skimr, lubridate, tibble, and waffle.\n\n``` r\ninstall.packages(\"tidyverse\")\ninstall.packages(\"dplyr\")\ninstall.packages(\"here\")\ninstall.packages(\"janitor\")\ninstall.packages(\"skimr\")\ninstall.packages(\"lubridate\")\ninstall.packages(\"tibble\")\ninstall.packages(\"waffle\")\n\nlibrary(tidyverse)\nlibrary(dplyr)\nlibrary(here)\nlibrary(janitor)\nlibrary(skimr)\nlibrary(lubridate)\nlibrary(tibble)\nlibrary(waffle)\n```\n\n## Importing our datasets\n\nNotes: We need to import our various datasets, which are in the form of\na .csv, into data frames that we can work with into R.\n\n``` r\ndaily_activity \u003c- read.csv(\"dailyActivity_merged.csv\", header = TRUE)\ndaily_calories \u003c- read.csv(\"dailyCalories_merged.csv\", header = TRUE)\ndaily_intensities \u003c- read.csv(\"dailyIntensities_merged.csv\", header = TRUE)\ndaily_steps \u003c- read.csv(\"dailySteps_merged.csv\", header = TRUE)\nsleep_day \u003c- read.csv(\"sleepDay_merged.csv\", header = TRUE)\nweight_log \u003c- read.csv(\"weightLogInfo_merged.csv\", header = TRUE)\nhourly_steps \u003c- read.csv(\"hourlySteps_merged.csv\", header = TRUE)\nhourly_intensity \u003c- read.csv(\"hourlyIntensities_merged.csv\", header = TRUE)\nhourly_calories \u003c- read.csv(\"hourlyCalories_merged.csv\", header = TRUE)\n```\n\n## Previewing datasets\n\nNotes: Getting a preview and examining the structure of our recently\nimported datasets for preliminary analysis and planning.\n\n``` r\nstr(daily_activity)\n```\n\n    ## 'data.frame':    940 obs. of  15 variables:\n    ##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityDate            : chr  \"4/12/2016\" \"4/13/2016\" \"4/14/2016\" \"4/15/2016\" ...\n    ##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...\n    ##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...\n    ##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...\n    ##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...\n    ##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...\n    ##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...\n    ##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...\n    ##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...\n    ##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...\n    ##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...\n    ##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...\n    ##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...\n    ##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...\n\n``` r\nstr(daily_calories)\n```\n\n    ## 'data.frame':    940 obs. of  3 variables:\n    ##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityDay: chr  \"4/12/2016\" \"4/13/2016\" \"4/14/2016\" \"4/15/2016\" ...\n    ##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...\n\n``` r\nstr(daily_intensities)\n```\n\n    ## 'data.frame':    940 obs. of  10 variables:\n    ##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityDay             : chr  \"4/12/2016\" \"4/13/2016\" \"4/14/2016\" \"4/15/2016\" ...\n    ##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...\n    ##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...\n    ##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...\n    ##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...\n    ##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...\n    ##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...\n    ##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...\n    ##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...\n\n``` r\nstr(daily_steps)\n```\n\n    ## 'data.frame':    940 obs. of  3 variables:\n    ##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityDay: chr  \"4/12/2016\" \"4/13/2016\" \"4/14/2016\" \"4/15/2016\" ...\n    ##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...\n\n``` r\nstr(sleep_day)\n```\n\n    ## 'data.frame':    413 obs. of  5 variables:\n    ##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ SleepDay          : chr  \"4/12/2016 12:00:00 AM\" \"4/13/2016 12:00:00 AM\" \"4/15/2016 12:00:00 AM\" \"4/16/2016 12:00:00 AM\" ...\n    ##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...\n    ##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...\n    ##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...\n\n``` r\nstr(weight_log)\n```\n\n    ## 'data.frame':    67 obs. of  8 variables:\n    ##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...\n    ##  $ Date          : chr  \"5/2/2016 11:59:59 PM\" \"5/3/2016 11:59:59 PM\" \"4/13/2016 1:08:52 AM\" \"4/21/2016 11:59:59 PM\" ...\n    ##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...\n    ##  $ WeightPounds  : num  116 116 294 125 126 ...\n    ##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...\n    ##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...\n    ##  $ IsManualReport: chr  \"True\" \"True\" \"False\" \"True\" ...\n    ##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...\n\n``` r\nstr(hourly_steps)\n```\n\n    ## 'data.frame':    22099 obs. of  3 variables:\n    ##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityHour: chr  \"4/12/2016 12:00:00 AM\" \"4/12/2016 1:00:00 AM\" \"4/12/2016 2:00:00 AM\" \"4/12/2016 3:00:00 AM\" ...\n    ##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...\n\n``` r\nstr(hourly_intensity)\n```\n\n    ## 'data.frame':    22099 obs. of  4 variables:\n    ##  $ Id              : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityHour    : chr  \"4/12/2016 12:00:00 AM\" \"4/12/2016 1:00:00 AM\" \"4/12/2016 2:00:00 AM\" \"4/12/2016 3:00:00 AM\" ...\n    ##  $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...\n    ##  $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...\n\n``` r\nstr(hourly_calories)\n```\n\n    ## 'data.frame':    22099 obs. of  3 variables:\n    ##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...\n    ##  $ ActivityHour: chr  \"4/12/2016 12:00:00 AM\" \"4/12/2016 1:00:00 AM\" \"4/12/2016 2:00:00 AM\" \"4/12/2016 3:00:00 AM\" ...\n    ##  $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...\n\nUpon examining the data frames and column names, we see that\ndaily_activities also contains the complete data set for daily_calories,\ndaily_intensities, and daily_steps as well.\n\nSo from now on when using the data for daily_calories,\ndaily_intensities, and daily_steps, I will reference their respective\ncolumns from the daily_activity data frame for simplicity.\n\n## Verifying sample size\n\nNotes: Counting how many distinct participants we have for each dataset,\nand whether it is appropriate to use for our analysis.\n\n``` r\nn_distinct(daily_activity$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(daily_calories$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(daily_intensities$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(daily_steps$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(sleep_day$Id)\n```\n\n    ## [1] 24\n\n``` r\nn_distinct(weight_log$Id)\n```\n\n    ## [1] 8\n\n``` r\nn_distinct(hourly_steps$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(hourly_intensity$Id)\n```\n\n    ## [1] 33\n\n``` r\nn_distinct(hourly_calories$Id)\n```\n\n    ## [1] 33\n\nIn statistics, there is a “rule of 30,” where 30 is typically the\nminimum sample size needed to make a statistically strong conclusion\nabout the data. So because of this I will discard the weight data, which\nonly has 8 people. Sleep has 24 participants, but I will still run a\npreliminary analysis on it regardless, with that in consideration.\n\n## Converting into proper date format\n\nWe also see from the previous section that the dates are in character\nformat, so we will need to convert to proper date or date/time format,\nand hide the old columns.\n\nThis is so our packages will be able to properly run on the data in the\ncolumn.\n\n``` r\ndaily_activity \u003c- daily_activity %\u003e%\n  mutate(ActivityDate = mdy(ActivityDate))\nhourly_steps \u003c- hourly_steps %\u003e%\n  mutate(ActivityDateHour = mdy_hms(ActivityHour)) %\u003e%\n  select(-ActivityHour)\nhourly_intensity \u003c- hourly_intensity %\u003e%\n  mutate(ActivityDateHour = mdy_hms(ActivityHour)) %\u003e%\n  select(-ActivityHour)\nhourly_calories \u003c- hourly_calories %\u003e%\n  mutate(ActivityDateHour = mdy_hms(ActivityHour)) %\u003e%\n  select(-ActivityHour)\nsleep_day \u003c- sleep_day %\u003e%\n  mutate(SleepDate = as.Date(mdy_hms(SleepDay))) %\u003e% \n  select(-SleepDay)\n```\n\n## Verifying the values were properly changed from character to date \u0026 date/time.\n\n``` r\nclass(daily_activity$ActivityDate)\n```\n\n    ## [1] \"Date\"\n\n``` r\nclass(hourly_steps$ActivityDateHour)\n```\n\n    ## [1] \"POSIXct\" \"POSIXt\"\n\n``` r\nclass(hourly_intensity$ActivityDateHour)\n```\n\n    ## [1] \"POSIXct\" \"POSIXt\"\n\n``` r\nclass(hourly_calories$ActivityDateHour)\n```\n\n    ## [1] \"POSIXct\" \"POSIXt\"\n\n``` r\nclass(sleep_day$SleepDate)\n```\n\n    ## [1] \"Date\"\n\n## Cleaning the data\n\nNotes: Each data frame was manually imported and assigned the\nappropriate naming convention by me in snakecase.\n\nNext we will check for errors and inconsistencies in the data by\nchecking for duplicated records and missing values.\n\n``` r\nsum(duplicated(daily_activity))\n```\n\n    ## [1] 0\n\n``` r\nsum(duplicated(sleep_day))\n```\n\n    ## [1] 3\n\n``` r\nsum(duplicated(hourly_steps))\n```\n\n    ## [1] 0\n\n``` r\nsum(duplicated(hourly_calories))\n```\n\n    ## [1] 0\n\n``` r\nsum(duplicated(hourly_intensity))\n```\n\n    ## [1] 0\n\n``` r\nsum(is.na(daily_activity))\n```\n\n    ## [1] 0\n\n``` r\nsum(is.na(sleep_day))\n```\n\n    ## [1] 0\n\n``` r\nsum(is.na(hourly_steps))\n```\n\n    ## [1] 0\n\n``` r\nsum(is.na(hourly_calories))\n```\n\n    ## [1] 0\n\n``` r\nsum(is.na(hourly_intensity))\n```\n\n    ## [1] 0\n\nsleep_day has 3 duplicated records. All other datasets passed the\ninitial check.\n\n## Get rid of duplicated observations, and drop missing values from data frames as needed.\n\n``` r\ndaily_activity \u003c- daily_activity %\u003e%\n  distinct %\u003e%\n  drop_na\nsleep_day \u003c-sleep_day %\u003e%\n  distinct %\u003e%\n  drop_na\nhourly_steps \u003c- hourly_steps %\u003e%\n  distinct %\u003e%\n  drop_na\nhourly_calories \u003c- hourly_calories %\u003e%\n  distinct %\u003e%\n  drop_na\nhourly_intensity \u003c- hourly_intensity %\u003e%\n  distinct %\u003e%\n  drop_na\n```\n\n## Merge to create a new combined data frame, and verify distinct users.\n\n``` r\ndaily_activity_sleep \u003c- merge(daily_activity, sleep_day, by.x = c(\"Id\", \"ActivityDate\"), by.y = c(\"Id\", \"SleepDate\"), all = TRUE)\n\nn_distinct(daily_activity_sleep$Id)\n```\n\n    ## [1] 33\n\nHere we are creating a newly merged data frame for one of our analyses\nthat we are going to run.\n\n\n# **Analyze \u0026 Share -**\n\n\n## Basic overview summary:\n\nNotes: Some basic preliminary summary analysis\n\n``` r\ndaily_activity %\u003e%\n  select(TotalSteps, Calories, SedentaryMinutes) %\u003e%\n  summary()\n```\n\n    ##    TotalSteps       Calories    SedentaryMinutes\n    ##  Min.   :    0   Min.   :   0   Min.   :   0.0  \n    ##  1st Qu.: 3790   1st Qu.:1828   1st Qu.: 729.8  \n    ##  Median : 7406   Median :2134   Median :1057.5  \n    ##  Mean   : 7638   Mean   :2304   Mean   : 991.2  \n    ##  3rd Qu.:10727   3rd Qu.:2793   3rd Qu.:1229.5  \n    ##  Max.   :36019   Max.   :4900   Max.   :1440.0\n\n``` r\nsleep_day %\u003e%\n  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %\u003e%\n  summary()\n```\n\n    ##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed \n    ##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  \n    ##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  \n    ##  Median :1.00      Median :432.5      Median :463.0  \n    ##  Mean   :1.12      Mean   :419.2      Mean   :458.5  \n    ##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  \n    ##  Max.   :3.00      Max.   :796.0      Max.   :961.0\n\n``` r\nhourly_steps %\u003e%\n  select(StepTotal) %\u003e%\n  filter(StepTotal != 0) %\u003e%\n  summary()\n```\n\n    ##    StepTotal      \n    ##  Min.   :    1.0  \n    ##  1st Qu.:  103.0  \n    ##  Median :  287.0  \n    ##  Mean   :  552.7  \n    ##  3rd Qu.:  643.0  \n    ##  Max.   :10554.0\n\n``` r\nhourly_calories %\u003e%\n  select(Calories) %\u003e%\n  summary()\n```\n\n    ##     Calories     \n    ##  Min.   : 42.00  \n    ##  1st Qu.: 63.00  \n    ##  Median : 83.00  \n    ##  Mean   : 97.39  \n    ##  3rd Qu.:108.00  \n    ##  Max.   :948.00\n\n``` r\nhourly_intensity %\u003e%\n  select(TotalIntensity, AverageIntensity) %\u003e%\n  summary()\n```\n\n    ##  TotalIntensity   AverageIntensity\n    ##  Min.   :  0.00   Min.   :0.0000  \n    ##  1st Qu.:  0.00   1st Qu.:0.0000  \n    ##  Median :  3.00   Median :0.0500  \n    ##  Mean   : 12.04   Mean   :0.2006  \n    ##  3rd Qu.: 16.00   3rd Qu.:0.2667  \n    ##  Max.   :180.00   Max.   :3.0000\n\n-   The Center for Disease Control (CDC) recommends that the average\n    adult take around 8,000-10,000+ steps per day. Compared to our data\n    showing the mean and median daily steps at 7638 and 7406,\n    respectively. Adults are not being as active as they should be. One\n    premature recommendation is a device feature reminding people to get\n    up and be active/walk around once in a while, after a certain amount\n    of sedentary minutes has elapsed.\n-   The National Institute of Health (NIH) recommends that adults get\n    between 7 and 9 hours of sleep per night. The average sleep time for\n    our group here is just around 7 hours, barely at the lower end of\n    the minimum. The data shows that this sample does not meet the\n    recommended goal. We can use this as an opportunity to add a feature\n    to our device that reminds users to go to bed at a reasonable time,\n    given what time they wake up. As well as add in these\n    recommendations from government institutions as a credible source to\n    help persuade people to comply.\n-   The average number of hourly steps during waking hours for the\n    participants in our study is 552.7 steps, with a median of 287.0.\n    This indicates the data is skewed to the right.\n-   The average number of hourly calories burned for our study\n    participants is ~98 calories.\n-   The median hourly total intensity for the FitBit users is 3.00, with\n    a mean of 12.04, also indicating the data is skewed right.\n\n## Let days where users used at least one functionality of the device count as an “active day.”\n\nOut of a possible 31 days.\n\nLet us classify usage days as:\n\n-   0-10 Low usage\n\n-   11-20 Medium usage\n\n-   21-31 High usage\n\n``` r\nusage_counts \u003c- daily_activity_sleep %\u003e%\n  group_by(Id) %\u003e%\n  summarise(days_active = n_distinct(ActivityDate)) %\u003e%\n  mutate(device_usage = case_when(\n    days_active \u003e= 0 \u0026 days_active \u003c= 10 ~ \"Low Usage\",\n    days_active \u003e= 11 \u0026 days_active \u003c= 20 ~ \"Medium Usage\",\n    days_active \u003e= 21 \u0026 days_active \u003c= 31 ~ \"High Usage\"\n  )) %\u003e%\n  group_by(device_usage) %\u003e%\n  count(device_usage)\n  \nprint(usage_counts)\n```\n\n    ## # A tibble: 3 × 2\n    ## # Groups:   device_usage [3]\n    ##   device_usage     n\n    ##   \u003cchr\u003e        \u003cint\u003e\n    ## 1 High Usage      29\n    ## 2 Low Usage        1\n    ## 3 Medium Usage     3\n\n``` r\n  waffle_data \u003c- setNames(usage_counts$n, usage_counts$device_usage)\nwaffle(waffle_data,\n       rows = 3,\n       xlab = \"1 Square = 1 User\",\n       title = \"Device Usage Frequency by Total Days\",\n       colors = c(\"green\", \"red\", \"yellow\"),\n       size = 1,\n       legend_pos = \"right\")\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/blob/main/Images/1.png)\n\n-   The majority of our sample participants have engaged with their\n    device at least once per day throughout our study period, most\n    fitting into our criteria of “high usage,” which is at least over\n    2/3rds of the total time frame of 31 days.\n-   We can market our more premium subscription tiers for the high usage\n    class who are clearly into their health metrics. We can also do this\n    with different price plans and types to capture the different ways\n    people prefer to pay, as well as for those hesitant. Possibly one\n    that offers more metrics, recommendations, and reminders based on\n    different tiers.\n-   Can we determine how we can get the low/medium usage groups to\n    engage more often?\n-   What are the reasons behind the low usage group and is it worth\n    looking into?\n\n## Visualizations\n\nWhat else can we interpret from creating some visuals?\n\n``` r\ncalories_distance_model \u003c- lm(TotalDistance ~ Calories, data = na.omit(daily_activity))\ncalories_distance_r_squared \u003c- summary(calories_distance_model)$r.squared\nggplot(data = na.omit(daily_activity), aes(x = Calories, y = TotalDistance)) + \n  geom_point(color = \"blue\") + \n  geom_smooth(method = \"lm\", color = \"yellow\") +\n  labs(title = \"Longer Distances Walked Leads to More Calories Burned\",\n       x = \"Daily Calories Burned\",\n       y = \"Total Daily Distance Walked\") +\n  theme_minimal() +\n  annotate(\"text\", x = max(na.omit(daily_activity)$Calories) * 0.7, \n           y = max(na.omit(daily_activity)$TotalDistance) * 0.9, \n           label = paste(\"R² = \", round(calories_distance_r_squared, 3)), size = 5, color = \"black\")\n```\n\n    ## `geom_smooth()` using formula = 'y ~ x'\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/2.png)\n\n-   The general trend here is that there is a positive correlation\n    between total daily distance walked and daily calories burned. We\n    can include this information in an pop-up reminder for the user to\n    be aware of, if that’s their goal.\n\n``` r\nsleep_day_of_week \u003c- sleep_day %\u003e% \n  mutate(day_of_week = wday(SleepDate, label = TRUE, abbr = FALSE))\naverage_sleep_per_day \u003c- sleep_day_of_week %\u003e%\n  group_by(day_of_week) %\u003e%\n  summarise(average_sleep_time = mean(TotalMinutesAsleep, na.rm = TRUE)) %\u003e%\n  arrange(match(day_of_week, c(\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\", \"Sunday\")))\noverall_mean_sleep \u003c- mean(average_sleep_per_day$average_sleep_time)\nggplot(average_sleep_per_day, aes(x = day_of_week, y = average_sleep_time, fill = day_of_week)) +\n  geom_col() + \n  geom_hline(yintercept = overall_mean_sleep, linetype = \"dashed\", color = \"red\") +\n  labs( title = \"Why Do People Sleep More on Wednesdays?\", x = \"Day of the Week\", y = \"Average Time Asleep (minutes)\") + \n  theme_minimal() +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/3.png)\n\n-   It looks like people sleep slightly more on the weekend on average.\n    This is expected as people are recovering from the workweek and tend\n    to sleep in.\n\n-   Possible point of future analysis is why it appears people on\n    average also sleep more on Wednesdays as well.\n\n``` r\nhourly_calories_by_hour \u003c- hourly_calories %\u003e% \n  mutate(hour_of_day = hour(ActivityDateHour))\naverage_hourly_calories \u003c- hourly_calories_by_hour %\u003e% \n  group_by(hour_of_day) %\u003e% \n  summarise(average_calories = mean(Calories, na.rm = TRUE)) %\u003e% \n  arrange(hour_of_day)\nggplot(average_hourly_calories, aes(x = hour_of_day, y = average_calories, fill = factor(hour_of_day))) + \n  geom_bar(stat = \"identity\", show.legend = FALSE) +\n  labs(title = \"Most Hourly Calories Are Burned During Active Hours\",\n       x = \"Hour of Day\",\n       y = \"Average Calories Burned\" ) +\n  theme_minimal() +\n  scale_x_continuous(breaks = 0:23) +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/4.png)\n\n-   The most active hours of the day are during the waking hours.\n-   Most people wake up around 6-7am for the day, with the calories\n    burned gradually ramping up hourly until they get off of work at\n    around 4-5pm. Then presumably they would go take care of other\n    miscellaneous things for a little while, before hourly calories\n    burned comes back down towards the end of the day.\n\n``` r\nhourly_intensity_by_hour \u003c- hourly_intensity %\u003e%\n  mutate( hour_of_day = hour(ActivityDateHour),\n          day_type = if_else(wday(ActivityDateHour, label = TRUE) %in% c(\"Sat\", \"Sun\"), \"Weekend\", \"Weekday\"))\naverage_hourly_intensity_by_weekend_weekday \u003c- hourly_intensity_by_hour %\u003e%\n  group_by(day_type, hour_of_day) %\u003e% summarise(mean_intensity_hour = mean(TotalIntensity, na.rm = TRUE)) %\u003e% \n  arrange(factor(day_type, levels = c(\"Weekday\", \"Weekend\")), hour_of_day)\n```\n\n    ## `summarise()` has grouped output by 'day_type'. You can override using the\n    ## `.groups` argument.\n\n``` r\nggplot(average_hourly_intensity_by_weekend_weekday, aes(x = hour_of_day, y = mean_intensity_hour, fill = day_type)) +\n  geom_bar(stat = \"identity\", position = \"dodge\") +\n  labs( title = \"People's Routines Differ on Weekdays vs Weekends\", \n        x = \"Hour of Day\",\n        y = \"Mean Intensity Value\" ) +\n  scale_x_continuous(breaks = 0:23) +\n  scale_fill_manual(values = c(\"Weekday\" = \"blue\", \"Weekend\" = \"pink\")) +\n  theme_minimal() +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/5.png)\n\n-   People have different schedules/routines on the weekend vs weekday.\n-   We can take this into consideration when designing our app features.\n\n``` r\ndistance_calories_model \u003c- lm(Calories ~ TotalDistance, data = daily_activity)\ndistance_calories_r_squared \u003c- summary(distance_calories_model)$r.squared\nggplot(daily_activity, aes(x = TotalDistance, y = Calories)) +\n  geom_point(alpha = 0.6, color = \"blue\") +  \n  geom_smooth(method = \"lm\", color = \"red\", linewidth = 1) +  \n  labs(title = \"A Positive Correlation Between Daily Distance and Calories\",\n       x = \"Total Distance\",\n       y = \"Calories\") +\n  annotate(\"text\", x = max(daily_activity$TotalDistance) * 0.7, \n           y = max(daily_activity$Calories) * 0.9, \n           label = paste(\"R-squared = \", round(distance_calories_r_squared, 3)), \n           color = \"black\", size = 5, fontface = \"bold\") +  \n  theme_minimal() +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n    ## `geom_smooth()` using formula = 'y ~ x'\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/6.png)\n\n``` r\nsteps_calories_model \u003c- lm(Calories ~ TotalSteps, data = daily_activity)\nsteps_calories_r_squared \u003c- summary(steps_calories_model)$r.squared\nggplot(daily_activity, aes(x = TotalSteps, y = Calories)) +\n  geom_point(alpha = 0.6, color = \"blue\") +  \n  geom_smooth(method = \"lm\", color = \"red\", linewidth = 1) +  \n  labs(title = \"Positive Correlation of Steps vs Calories\",\n       x = \"Total Steps\",\n       y = \"Calories\") +\n  annotate(\"text\", x = max(daily_activity$TotalSteps) * 0.7, \n           y = max(daily_activity$Calories) * 0.9, \n           label = paste(\"R-squared = \", round(steps_calories_r_squared, 3)), \n           color = \"black\", size = 5, fontface = \"bold\") +  \n  theme_minimal() +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n    ## `geom_smooth()` using formula = 'y ~ x'\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/7.png)\n\n-   Both these metrics have similar predictive correlations for calories\n    burned.\n\n``` r\nsteps_distance_model \u003c- lm(TotalDistance ~ TotalSteps, data = daily_activity)\nsteps_distance_r_squared \u003c- summary(steps_distance_model)$r.squared\nggplot(daily_activity, aes(x = TotalSteps, y = TotalDistance)) +\n  geom_point(alpha = 0.6, color = \"blue\") +  \n  geom_smooth(method = \"lm\", color = \"red\", linewidth = 1) +  \n  labs(title = \"Closely Correlated As Expected\",\n       x = \"Total Daily Steps\",\n       y = \"Total Daily Distance\") +\n  annotate(\"text\", x = max(daily_activity$TotalSteps) * 0.7, \n           y = max(daily_activity$TotalDistance) * 0.9, \n           label = paste(\"R-squared = \", round(steps_distance_r_squared, 3)), \n           color = \"black\", size = 5, fontface = \"bold\") +  \n  theme_minimal() +\n  theme(axis.text.x = element_text(angle = 45, hjust = 1))  \n```\n\n    ## `geom_smooth()` using formula = 'y ~ x'\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/8.png)\n\n-   Both daily steps and total distance are a good predictor for daily\n    calories burned.\n\n``` r\naverage_hourly_steps_intensity_merged \u003c- hourly_steps %\u003e%\n  merge(hourly_intensity, by = \"ActivityDateHour\") %\u003e%\n  group_by(hour = hour(ActivityDateHour)) %\u003e%\n  summarise(\n    average_hourly_steps = mean(StepTotal, na.rm = TRUE), \n    average_hourly_intensity = mean(TotalIntensity, na.rm = TRUE)\n  ) %\u003e%\n  arrange(hour)\n\n\nggplot(average_hourly_steps_intensity_merged, aes(x = hour)) +\n  geom_line(aes(y = average_hourly_steps, color = \"Average Hourly Steps\"), linewidth = 1) +\n  geom_line(aes(y = average_hourly_intensity * 100, color = \"Average Hourly Intensity\"), linewidth = 1) +\n  scale_y_continuous(\n    name = \"Average Hourly Steps\",\n    sec.axis = sec_axis(~ . / 100, name = \"Average Hourly Intensity (Scaled x100)\")\n  ) +\n  labs(\n    title = \"People Are Most Active Mid-day\", \n    x = \"Hour of Day\"\n  ) +\n  theme_minimal() +\n  theme(\n    axis.text.x = element_text(angle = 45, hjust = 1), \n    axis.title.y.right = element_text(color = \"blue\"),\n    axis.text.y.right = element_text(color = \"blue\"), \n    axis.title.y.left = element_text(color = \"green\"),\n    axis.text.y.left = element_text(color = \"green\")\n  ) +\n  scale_color_manual(values = c(\"Average Hourly Steps\" = \"green\", \"Average Hourly Intensity\" = \"blue\")) +\n  theme(legend.title = element_blank())\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/9.png)\n\n-   Both hourly steps and intensity follow a similar trend of ramping up\n    mid-day.\n\n``` r\npercent_time_in_bed_sleeping \u003c- sleep_day %\u003e%\n  summarise( sum_total_minutes_asleep = sum(TotalMinutesAsleep, na.rm = TRUE),\n             sum_total_time_in_bed = sum(TotalTimeInBed, na.rm = TRUE),\n             percentage_asleep = (sum_total_minutes_asleep / sum_total_time_in_bed) * 100,\n             percentage_awake = 100 - percentage_asleep)\n\npercent_sleep_pie \u003c- data.frame(\n  category = c(\"Time In Bed Asleep\", \"Time In Bed Awake\"), \n  percentage = c(percent_time_in_bed_sleeping$percentage_asleep,\n                 percent_time_in_bed_sleeping$percentage_awake) )\n\nggplot(percent_sleep_pie, aes(x = \"\", y = percentage, fill = category)) +\n  geom_bar(stat = \"identity\", width = 1) +\n  coord_polar(theta = \"y\") +\n  labs(title = \"Are Users Wasting Time Lounging Around In Bed?\") + \n  theme_void() +  # Remove background grid\n  scale_fill_manual(values = c(\"skyblue\", \"salmon\")) +  \n  geom_text(aes(label = paste0(round(percentage, 1), \"%\")),  \n            position = position_stack(vjust = 0.5),  \n            color = \"black\", size = 6)\n```\n\n![](https://github.com/AnthonysAnalysis/Data-Analytics-Portfolio/raw/main/Images/10.png)\n\n-   Here we have percent time spent sleeping as a function of total time\n    spent in bed, on average between all individual users.\n\n-   As a whole, it looks like users are not getting distracted and\n    having a hard time falling asleep.\n\n-   If they did spend more time awake in bed, we could possibly\n    implement a feature to try and remind users to stop all activity and\n    focus on trying to sleep if they want to hit their scheduled hours\n    of sleep goal, for those who need or want it.\n\n\n# **Act -**\n\n\n## Recommendations \u0026 final thoughts:\n\nMarketing and branding:\n\n-   Work on a marketing campaign that reinvents our brand as more than\n    just a product, but as a lifestyle. That if you use our product,\n    you’re a part of an exclusive group with a company that knows and\n    understands you. A brand that empowers a group whose concerns have\n    been historically overlooked. We offer women focused recommendations\n    that no other company such as Fitbit can.\n\nExpand products offerings:\n\n-   From analyzing our Fitbit smart device usage frequency, we saw that\n    29/33 users were categorized as “high frequency” users in which they\n    used their device over ~2/3 the individual days of the total study\n    period. They are clearly interested in the smart device tech field,\n    so we can market to and offer them other similar products as well.\n    We would conduct market research to figure out what this demographic\n    would also be interested in and go from there.\n\n-   By offering more products, a strong selling point to purchase more\n    of our products would be seamless integration across products from\n    our brand ecosystem, as well as possible cost savings under one\n    subscription membership. This would incentivize customers to stay\n    loyal to our brand, rather than exploring our competitors.\n\nExpand current product features:\n\n-   Research other features that customers in this segment would be\n    into, and determine whether to offer it as a new product or add into\n    an existing product.\n\n-   Offer a connectedness feature. Give users the opportunity to share\n    to social media, or other “Bellabeat friends” when they achieve a\n    certain goal or milestone.\n\n-   Add an “educational feature” that gives recommendations from\n    credible sources such as healthy lifestyle habits, such as the\n    Center for Disease Control (CDC) or National Institute of Health\n    (NIH). We saw from our analysis that a good portion of adults from\n    our study barely met the daily recommended amount of steps and sleep\n    hours on the lower end.\n\n-   For recommendations such as the above, implement a reminder feature\n    that allows users to set a goal, with appropriate reminders to help\n    users reach that goal if it appears they are not on track.\n\nDiversity distribution channels:\n\n-   Bellabeat currently only sells their products through online\n    retailers and their own website. An easy way to increase our reach\n    and market share is to extend our distribution channels. We could\n    begin offering our products in physical retail stores as well.\n    Bellabeat being a women-centric company, could further capitalize on\n    this by partnering with other women focused retail stores to reach\n    their target demographic easier.\n\nFuture analysis:\n\n-   If given the opportunity, I’d like to run a future analysis on\n    Bellabeat’s data itself, instead of relying on Fitbit’s data and\n    extrapolating to our company, as the differing demographics may be a\n    point of weak validity of this analysis. By analyzing Bellabeat’s\n    data, we would also get insight into some metrics that only our\n    women-centric products captured, and not the Fitbit, such as the\n    reproductive cycle.\n\n-   Trends for various metrics may also be different for women vs a\n    mixed, unknown demographic such as that of the Fitbit data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanthonysanalysis%2Fbellabeat-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanthonysanalysis%2Fbellabeat-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanthonysanalysis%2Fbellabeat-analysis/lists"}