{"id":24696782,"url":"https://github.com/yinkaadu/cvdproject_files","last_synced_at":"2026-04-12T11:38:28.584Z","repository":{"id":273157902,"uuid":"918843103","full_name":"YinkaAdu/cvdproject_files","owner":"YinkaAdu","description":"Raw data, SQL database and code; Python code and visualisations for Heart Disease Project","archived":false,"fork":false,"pushed_at":"2025-01-23T12:11:46.000Z","size":21157,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T02:06:56.508Z","etag":null,"topics":["3d-scatter-plot","data-calculations","excel-import","healthcare","multiple-subplots","mysql-database","powerbi","visualization"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YinkaAdu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-19T02:07:10.000Z","updated_at":"2025-01-23T12:11:50.000Z","dependencies_parsed_at":"2025-01-22T06:24:48.097Z","dependency_job_id":null,"html_url":"https://github.com/YinkaAdu/cvdproject_files","commit_stats":null,"previous_names":["yinkaadu/cardio_sql_files","yinkaadu/cvdproject_files"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YinkaAdu%2Fcvdproject_files","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YinkaAdu%2Fcvdproject_files/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YinkaAdu%2Fcvdproject_files/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YinkaAdu%2Fcvdproject_files/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YinkaAdu","download_url":"https://codeload.github.com/YinkaAdu/cvdproject_files/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244901184,"owners_count":20528879,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-scatter-plot","data-calculations","excel-import","healthcare","multiple-subplots","mysql-database","powerbi","visualization"],"created_at":"2025-01-27T02:04:31.862Z","updated_at":"2026-04-12T11:38:28.549Z","avatar_url":"https://github.com/YinkaAdu.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Health Monitoring Data Analysis to Inform Personalized CVD Care\n \n## Background\nCardiovascular Disease (CVD) more commonly known as heart disease remains a major global health challenge. At least 20 million people die of CVD each year (WHO, 2021). While many deaths are preventable, progress in reducing these deaths has slowed down in recent times. \n \n## Project Goal\nThis study aims to better understand the risk factors for heart disease specifically to identify which factors, individually and together, increase the risk of heart disease. This information could potentially inform the use of healthcare data to increase the precision and accuracy of cardiovascular disease prediction. The author analysed data from about 70,000 participants, studying the relationship between factors such as age, gender, blood sugar, cholesterol, weight, and lifestyle habits.\n\n## Statistical Analysis\nStatistical analysis was carried our in R to answer the following questions:\n1. Do CVD patients and people without CVD differ in their lifestyle habits (smoking, drinking, activity), glucose levels, and cholesterol levels?\n2. Do CVD patients differ in Body Mass Index (BMI) from people without CVD?\n\n**Question 1** was tackled using Pearson's Chi-squared test with Yates' continuity correction. While a significantly higher proportion of the population are non-smokers, non-drinkers and physically active, analysis results suggested that smoking, physical activity levels, glucose levels, and cholesterol levels were significantly associated with the presence of cardiovascular conditions in the dataset.\nAlcohol consumption on the other hand may have a weaker or no significant association with cardiovascular conditions based on this analysis.\n\n**Question 2** was tackled using Welch Two Sample t-test. The analysis provided strong evidence that individuals with cardiovascular conditions have a significantly higher mean BMI compared to those without. \n\n![cvd_stat_visual](https://github.com/user-attachments/assets/88ce4465-1e02-4ad5-849b-97d16b8ddb1e)\n\n## Sample Visuals\n\n### Cardiovascular Disease (CVD) Incidence across different Body Mass Index (BMI) Categories\nThe first image provides a summary of the data as well as some of the considered parameters in this study. The BMI categories were assigned according to the WHO crude estimate for BMI classification. Participant population was first grouped by the presence of CVD and further grouped by gender. \nThe second image shows further exploration of the lifestyle habits, glucose and cholesterol levels of the female population. \nThe third image shows identical parameters for the male population. \n\nhttps://github.com/user-attachments/assets/2ec30ed8-6452-4bf3-a424-46dfe9895ee1\n\n### Lifestyle Habits vs CVD Incidence\nThe final image is an interactive visual depicting a permutation of lifestyle habits in 8 subplots that can be filtered/sliced to select groups of interest to identify which factors, individually and together, increase the risk of heart disease. \n\nhttps://github.com/user-attachments/assets/3a753ed7-e602-471a-aa08-b83b9d4cb7a0\n\nGlucose levels and Cholesterol levels were combined to form 9 categories to create a slicer. The 'lifestyle habits' were also grouped into 8 categories making up the subplots shown (further detail in Multiple_CVD_subplots.ipynb notebook)\n\n##Software Tools\nMySQL, Jupyter Notebook, Power BI Desktop, MS PowerPoint and MS Excel\n\n ### Snippets of SQL Code for Data Preparation\n~~~~sql\nSELECT *\nFROM cardiovascular_incidence\nORDER BY 2;\n\n# Choose more descriptive column names\nALTER TABLE cardiovascular_incidence\n\tCHANGE COLUMN height height_in_cm\n\t\tINT NULL;\n\t\t\t\nALTER TABLE cardiovascular_incidence\n\tCHANGE COLUMN weight weight_in_kg\n\t\tINT NULL;\n\n# Calculate the BMI (add new column body_mass_index)\n# Rank the BMI - 1 : underweight, 2 : healthy, 3 : overweight, 4 : obese, 5 : severely obese (add new column bmi_ranking)*/\n\t\nALTER TABLE cardio_train_incidence\n\tADD body_mass_index\n\t\tFLOAT Default 0\n\t\t\tAFTER weight_in_kg;\nUPDATE cardio_train_incidence\nSET\n\tbody_mass_index = IFNULL(weight_in_kg /((height_in_cm/100) * (height_in_cm/100)), 0);\n\n# Having noticed abnormally high body_mass_index, I identified wrong entries in height_in_cm\nUPDATE cardio_train_incidence\nSET\n\theight_in_cm = height_in_cm + 100\nWHERE height_in_cm \u003c 100;\n\n# Create a new column to compute different permutations of lifestyle habits (smoker, alcohol and active)\nALTER TABLE cardio_train_incidence\n\tADD lifestyle_habits\n\t\tFLOAT Default 0\n\t\t\tAFTER active;\nUPDATE cardio_train_incidence\nSET\n\tlifestyle_habits = 1\nWHERE \n\t(smoker is False AND alcohol is False AND active is False);\n# And so on for the following permitations\n# 2 : where smoker is False and alcohol is False and Active is True\n# 3 : where smoker is False and alcohol is True and Active is False\n# 4 : where smoker is False and alcohol is True and Active is True\n# 5 : where smoker is True and alcohol is False and Active is False\n# 6 : where smoker is True and alcohol is False and Active is True\n# 7 : where smoker is True and alcohol is True and Active is False\n# 8 : where smoker is True and alcohol is True and Active is True\n~~~~\n\n### Snippets of Python Code for Data Visualisation \n\n```python\n# Bring in packages\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport altair as alt\n\n# Read Queried Data\ndf = pd.read_csv(r'C:\\Users\\User\\OneDrive\\Documents\\Cardiohealth\\bmi_pop_data.csv') \nprint(df.head())\n\n# Boolean Values are still in TINYINT; Convert to True/False\ndf['cardio_condition'] = df['cardio_condition'].apply(lambda x: True if x == 1 else False)\n\nprint(len(df))\n\n# Count the frequency of each BMI classification\nbmigendergroup_counts = df.groupby(['bmi_class', 'cardio_condition', 'gender']).size().unstack(['cardio_condition', 'gender'])\nbmigroup_counts =df.groupby(['bmi_class', 'cardio_condition']).size().unstack(fill_value=0) \nbmicategory_counts = df['bmi_class'].value_counts()\nbmicategory_counts.columns = ['bmi_class', 'frequency']\n\n# Create the bar chart of Frequency against BMI Category\nplt.figure(figsize=(8, 6))\nbmicategory_counts.plot(kind='bar', color='skyblue')\nplt.title('Distribution of BMI Classes')\nplt.xlabel('BMI Category')\nplt.ylabel('Number of Participants')\nplt.show()\n\n# Display the plot\nplt.tight_layout()\nplt.show()\n\n# Save the plot as a PNG image\nplt.savefig('BMIcategory_distribution_bar_chart.png')\n```\nPython was also used to add the multiple subplots to Power BI. Details of this  can be found in the repository associated with this project. It includes notebooks containing code and visual output.\n\n### Snippets of R code for Analysis\n```{r}\n# Find the path of working directory\ngetwd()\n\n# Load necessary libraries\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(stats)\n\n# Create a data frame from cvd_dataset\ndf \u003c- data.frame(cvd_dataset)\n\n# 1. Do CVD patients and people without CVD differ in their lifestyle habits (smoking, drinking, activity), glucose levels, and cholesterol levels?\n\n#   a. Smoking\ntable(df$cardio_condition, df$smoker) \nchisq.test(table(df$cardio_condition, df$smoker)) \n\n#   b. Alcohol Consumption\ntable(df$cardio_condition, df$alcohol) \nchisq.test(table(df$cardio_condition, df$alcohol)) \n\n#   c. Activity\ntable(df$cardio_condition, df$active) \nchisq.test(table(df$cardio_condition, df$active)) \n\n#   d. Glucose Levels\ntable(df$cardio_condition, df$glucose_levels) \nchisq.test(table(df$cardio_condition, df$glucose_levels)) \n\n#   e. Cholesterol Levels\ntable(df$cardio_condition, df$cholesterol_levels) \nchisq.test(table(df$cardio_condition, df$cholesterol_levels))\n``` \n## Challenges Encountered\n1. Most of the data used was categorical: Smoker, active and alcohol consumption were all boolean data type even though it is unclear what the threshold for an active lifestyle is for instance. Hence, the limited level of statistical analysis.\n2. There was no detail about CVD types: CVDs comprise a broad range of diseases that are often influenced in different ways by the factors studied. Knowing what types of CVD participants were living with would most likely have reduced result ambiguity.\n\n## Contributors and Collaborators\nContributions and comments are welcome on this project. \nThe author is also willing to collaborate on data analytics projects.\n\n## License\n[MIT]\n(https://choosealicense.com/licenses/mit/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyinkaadu%2Fcvdproject_files","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyinkaadu%2Fcvdproject_files","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyinkaadu%2Fcvdproject_files/lists"}