{"id":28253990,"url":"https://github.com/melinteflxrin/softserve-bigdata-project","last_synced_at":"2026-01-26T17:31:22.366Z","repository":{"id":292792171,"uuid":"981971459","full_name":"melinteflxrin/SoftServe-BigData-Project","owner":"melinteflxrin","description":"End-to-end data warehousing project integrating APIs, ETL workflows, and PostgreSQL for analytics and reporting.","archived":false,"fork":false,"pushed_at":"2025-07-18T14:36:23.000Z","size":1696,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-18T19:04:00.029Z","etag":null,"topics":["analytics","api","bigdata","data","datawarehousing","externalapi","pipeline","postgres","postgresql","python","warehouse"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/melinteflxrin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-12T07:22:45.000Z","updated_at":"2025-07-18T14:36:27.000Z","dependencies_parsed_at":"2025-07-18T16:24:08.880Z","dependency_job_id":"18d06046-9a7a-4074-a4b0-fe448266e8d7","html_url":"https://github.com/melinteflxrin/SoftServe-BigData-Project","commit_stats":null,"previous_names":["melinteflxrin/softserve-bigdata-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/melinteflxrin/SoftServe-BigData-Project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melinteflxrin%2FSoftServe-BigData-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melinteflxrin%2FSoftServe-BigData-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melinteflxrin%2FSoftServe-BigData-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melinteflxrin%2FSoftServe-BigData-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/melinteflxrin","download_url":"https://codeload.github.com/melinteflxrin/SoftServe-BigData-Project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melinteflxrin%2FSoftServe-BigData-Project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28782930,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T13:55:28.044Z","status":"ssl_error","status_checked_at":"2026-01-26T13:55:26.068Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","api","bigdata","data","datawarehousing","externalapi","pipeline","postgres","postgresql","python","warehouse"],"created_at":"2025-05-19T18:19:08.700Z","updated_at":"2026-01-26T17:31:22.353Z","avatar_url":"https://github.com/melinteflxrin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Warehousing Project:\u003cbr/\u003e Health, Fitness \u0026 Nutrition Analytics\n\nThis project is a comprehensive data warehousing solution designed for a hypothetical health-tech startup.\u003cbr\u003e\nIt demonstrates best practices in ETL, data governance, privacy, and analytics with interactive dashboards and role-based access control.\n\n---\n\n## Table of contents\n1. [Description](#1-description)\n2. [Business Requirements \u0026 Goals](#2-business-requirements--goals)\n3. [Reports, Dashboards \u0026 KPIs](#3-reports-dashboards--kpis)\n    - [Dashboard Example Output](#dashboard-example-output)\n4. [Data Warehouse Design, Tables \u0026 Sources](#4-data-warehouse-design-tables--sources)\u003cbr\u003e\n   4.1 [APIs and Data Sources](#41-apis-and-data-sources)\u003cbr\u003e\n   4.2 [ETL Process](#42-etl-process)\u003cbr\u003e\n   4.3 [Schemas](#43-schemas)\u003cbr\u003e\n      - [Raw Schema](#raw-schema)\u003cbr\u003e\n      - [Staging Schema](#staging-schema)\u003cbr\u003e\n      - [Trusted Schema](#trusted-schema)\u003cbr\u003e\n5. [Database Administration \u0026 Data Governance](#5-database-administration--data-governance)\n    - [5.1 Database Administration](#51-database-administration)\u003cbr\u003e\n    - [5.2 Data Governance](#52-data-governance)\u003cbr\u003e\n6. [Graphical User Interface (GUI)](#6-graphical-user-interface-gui)\n    - [GUI Example](#gui-example)\n---\n\n## 1. Description\n\nThis repository showcases an end-to-end data warehouse solution designed for a fictional health-tech startup.\u003cbr\u003e\nThe scenario centers on a mobile application that helps users monitor their health and wellness by tracking activity, sleep, and nutrition data, with a data engineer responsible for building and maintaining the pipelines that enable secure, reliable analysis of this information.\n\nThe project demonstrates how to:\n- Integrate and process data from multiple sources (including synthetic data and public APIs).\n- Implement robust ETL pipelines for cleaning, validating, and transforming raw data into analytics-ready tables.\n- Enforce strong data governance, privacy, and security practices, including the separation of PII and non-PII data.\n- Provide interactive dashboards and key performance indicators (KPIs) for user activity, sleep, nutrition, and goal achievement.\n- Apply role-based access control to protect sensitive data.\n\n\u003e **Note:**  \n\u003e This project is for educational and demonstration purposes only. The app and its data are entirely fictional and intended to showcase best practices in data engineering, warehousing, and analytics.\n\n---\n\n## 2. Business Requirements \u0026 Goals\n\n### Business Requirements \n- Integrate and process user health data (activity, sleep, nutrition) from multiple sources.\n- Ensure data privacy by separating and protecting PII and non-PII data.\n- Maintain high data quality and secure, role-based access.\n- Support analytics and reporting with trusted, well-structured data.\n\n### Core Business Goals \n- Provide actionable insights on user health and goal achievement.\n- Demonstrate best practices in data governance and lifecycle management.\n- Enable interactive dashboards and KPIs for key health metrics.\n\n---\n\n## 3. Reports, Dashboards \u0026 KPIs\n\n\u003e **Note:**  \n\u003e Before generating dashboards, you must first extract, transform, and load the data by running the following scripts in order:  \n\u003e 1. [`src/extract/healthapp.py`](src/extract/healthapp.py)  \n\u003e 2. [`src/transform/transform.py`](src/transform/transform.py)  \n\u003e 3. [`src/load/load.py`](src/load/load.py)\n\nInteractive dashboards can be generated by running the provided script: \n```bash\npython src/dashboard/dashboard.py\n```\nThis script ([`src/dashboard/dashboard.py`](src/dashboard/dashboard.py)) automatically creates and visualizes dashboards for your key health and fitness metrics using the [trusted views](sql/business_view/).\n\n**Dashboards and KPIs included:**\n\n- **Goal Achievement:**  \n  - % of users who have achieved their goals  \n    - [`trusted.vw_pct_users_achieved_goals`](sql/business_view/create_view_pct_users_achieved_goals.sql): Calculates the percentage of user goals (across all goal types) that have been achieved.\n\n- **Nutrition:**  \n  - User's favourite food and typical meal time  \n    - [`trusted.vw_user_favourite_food`](sql/business_view/create_view_user_favourite_food.sql): Shows each user's most frequently consumed food and the meal time they usually eat it.\n  - Average macronutrient distribution per user  \n    - [`trusted.vw_user_avg_macros`](sql/business_view/create_view_user_avg_macros.sql): Displays the average intake of calories, carbs, protein, and fat per user.\n\n- **Activity:**  \n  - Daily average calories burned per user  \n    - [`trusted.vw_user_daily_avg_calories_burned`](sql/business_view/create_view_avg_calories_burned.sql): Reports the average number of calories burned per user per day.\n\n- **All dashboards:**  \n  - Displayed together for easy comparison\n\n\u003cdetails\u003e\n\u003csummary id=\"dashboard-example-output\"\u003eDashboard Example Output\u003c/summary\u003e\n\n![Dashboard Example](assets/dashboard-example-screenshot.png)  \n*Dashboard example shown for 10 generated users across 7 days*\n\n\u003c/details\u003e\n\n---\n\n## 4. Data Warehouse Design, Tables \u0026 Sources\n\n### 4.1 APIs and Data Sources\n\n- **HealthApp**: Synthetic data is generated using the [`healthapp.py`](src/extract/healthapp.py) script.  \n  - To generate raw data, run the following command:\n    ```bash\n    python src/extract/healthapp.py\n    ```\n  - This script creates the necessary schemas and tables in the `raw` schema (if they do not already exist) and populates them with raw user data (see [`raw.user_data`](sql/tables/raw/create_raw_user_data_table.sql)), for activity, sleep, nutrition, and goals.  \n- [**USDA API**](https://www.ers.usda.gov/developer/data-apis/): Nutritional data for food items is fetched dynamically via API calls within the [`healthapp.py`](src/extract/healthapp.py) script.  \n\n---\n\n## 4.2 ETL Process\n\n\u003cdetails\u003e\n\u003csummary\u003eShow ETL Pipeline Diagram\u003c/summary\u003e\n\n![ETL Pipeline](assets/etl-pipeline-diagram.png)  \n*Overview of the ETL pipeline: Extract, Transform, Load.*\n\n\u003c/details\u003e\n\u003cbr\u003e\nThe ETL pipeline consists of the following stages, each with a dedicated script:\n\n### 1. **Extract**\n- **Purpose:** Generate and collect raw data from synthetic sources and public APIs.\n- **Code:** [`src/extract/healthapp.py`](src/extract/healthapp.py)\n  - Generates synthetic user/activity/nutrition data and loads it into the `raw` schema.\n  - **Tables created in `raw` schema:**\n    - [`raw.user_data`](sql/tables/raw/create_raw_user_data_table.sql)\n    - [`raw.nutrition_log`](sql/tables/raw/create_raw_nutrition_log_table.sql)\n- **Supporting code:**\n  - [`src/extract/search_foods_api.py`](src/extract/search_foods_api.py): Fetches nutrition data from the USDA API.\n  - [`src/extract/load_data_from_csv.py`](src/extract/load_data_from_csv.py): Loads food and activity types from CSV files.\n- **How to run:**\n  ```sh\n  python src/extract/healthapp.py\n  ```\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Raw User Data\u003c/summary\u003e\n\n![Sample Raw Data](assets/sample-raw-user-data.png)  \n*Example of the `raw.user_data` table as generated by the extract script.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Raw Nutrition Log\u003c/summary\u003e\n\n![Sample Nutrition Log](assets/sample-raw-nutrition-log.png)  \n*Example of the `raw.nutrition_log` table as generated by the extract script.*\n\n\u003c/details\u003e\n\n---\n\n### 2. **Transform**\n- **Purpose:** Clean, validate, and transform raw data into analytics-ready staging tables.\n- **Code:** [`src/transform/transform.py`](src/transform/transform.py)\n  - Reads from the `raw` schema, processes the data, and loads it into the `staging` schema.\n  - **Tables created in `staging` schema:**\n    - [`staging.dim_user_profile`](sql/tables/staging/create_dim_user_profile_table.sql)\n    - [`staging.dim_food_item`](sql/tables/staging/create_dim_food_item_table.sql)\n    - [`staging.fact_activity_log`](sql/tables/staging/create_fact_activity_log_table.sql)\n    - [`staging.fact_sleep_log`](sql/tables/staging/create_fact_sleep_log_table.sql)\n    - [`staging.fact_nutrition_log`](sql/tables/staging/create_fact_nutrition_log_table.sql)\n    - [`staging.fact_goals_log`](sql/tables/staging/create_fact_goals_log_table.sql)\n- **How to run:**\n  ```sh\n  python src/transform/transform.py\n  ```\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Staging User Profile\u003c/summary\u003e\n\n![Sample Staging User Profile](assets/sample-staging-user-profle.png)  \n*Example of the `staging.dim_user_profile` table after transformation.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Staging Nutrition Log\u003c/summary\u003e\n\n![Sample Staging Nutrition Log](assets/sample-staging-nutrition-log.png)  \n*Example of the `staging.fact_nutrition_log` table after transformation.*\n\n\u003c/details\u003e\n\n---\n\n### 3. **Load**\n- **Purpose:** Load fully processed data from staging into the trusted schema for analytics and reporting.\n- **Code:** [`src/load/load.py`](src/load/load.py)\n  - Moves data from the `staging` schema into the `trusted` schema, enforcing all business and data quality rules.\n  - **Tables created in `trusted` schema:**\n    - [`trusted.nutrition_data`](sql/tables/trusted/create_trusted_nutrition_data_table.sql)\n    - [`trusted.sleep_data`](sql/tables/trusted/create_trusted_sleep_data_table.sql)\n    - [`trusted.activity_data`](sql/tables/trusted/create_trusted_activity_data_table.sql)\n    - [`trusted.goals_data`](sql/tables/trusted/create_trusted_goals_data_table.sql)\n- **How to run:**\n  ```sh\n  python src/load/load.py\n  ```\n\n\u003e **Note:**  \n\u003e Each script will automatically create the required tables in its schema if they do not already exist.\n\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Trusted Nutrition\u003c/summary\u003e\n\n![Sample Trusted Nutrition Data](assets/sample-trusted-nutrition-data.png)  \n*Example of the `trusted.nutrition_data` table after loading.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eShow Sample Trusted Goals\u003c/summary\u003e\n\n![Sample Trusted Goals Data](assets/sample-trusted-goals-data.png)  \n*Example of the `trusted.goals_data` table after loading.*\n\n\u003c/details\u003e\n\n---\n\n### 4.3 Schemas\n\nThe data warehouse is organized into three schemas: **raw**, **staging**, and **trusted**. Each schema serves a specific purpose in the ETL pipeline:\n\n#### Raw Schema\n- **Purpose**: The initial storage for raw, unprocessed data directly extracted from the sources.  \n\n#### `raw.user_data`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `record_id`       | `SERIAL PRIMARY KEY`  | Unique identifier for each record.      |\n| `user_id`         | `INTEGER`    | Unique identifier for the user.                 |\n| `name`            | `VARCHAR(100)` | User's name.                                   |\n| `age`             | `INTEGER`    | User's age.                                     |\n| `weight_kg`       | `NUMERIC(4,1)` | User's weight in kilograms.                    |\n| `height_cm`       | `NUMERIC(4,1)` | User's height in centimeters.                  |\n| `gender`          | `VARCHAR(10)` | User's gender.                                  |\n| `calorie_goal`    | `INTEGER`    | Daily calorie goal for the user.                |\n| `macro_goal`      | `JSON`       | JSON object containing macro goals (carbs, protein, fat). |\n| `activity_start`  | `TIMESTAMP`  | Start time of the activity.                     |\n| `activity_type`   | `VARCHAR(50)` | Type of activity (e.g., walking, running).      |\n| `steps`           | `INTEGER`    | Daily step count                                |\n| `heart_rate`      | `INTEGER`    | Heart rate during the activity.                 |\n| `calories_burned` | `INTEGER`    | Calories burned during the day.                 |\n| `sleep_start`     | `TIMESTAMP`  | Start time of sleep.                            |\n| `sleep_end`       | `TIMESTAMP`  | End time of sleep.                              |\n| `sleep_quality_score` | `INTEGER` | Quality score of sleep.                         |\n| `goal_type`       | `VARCHAR(50)` | Type of goal (e.g., calories burned, steps taken). |\n| `goal_target`     | `INTEGER`    | Target value for the goal.                      |\n| `created_at`      | `TIMESTAMP`  | Timestamp when the record was created.          |\n\n#### `raw.nutrition_log`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `nutrition_id`    | `SERIAL PRIMARY KEY` | Unique identifier for the nutrition record.  |\n| `user_id`         | `INTEGER`     | Unique identifier for the user.                 |\n| `date`            | `DATE`       | Date of the nutrition log.                      |\n| `food_item`       | `VARCHAR(255)` | Name of the food item.                         |\n| `meal_type`       | `VARCHAR(100)` | Type of meal (e.g., breakfast, lunch).         |\n| `calories_per_100g` | `INTEGER` | Calories per 100 grams of the food item.     |\n| `carbs_per_100g`  | `INTEGER` | Carbohydrates per 100 grams of the food item. |\n| `protein_per_100g` | `INTEGER` | Protein per 100 grams of the food item.       |\n| `fat_per_100g`    | `INTEGER` | Fat per 100 grams of the food item.           |\n\n---\n\n#### Staging Schema\n- **Purpose**: Stores cleaned and transformed data, ready for further processing.  \n\u003cdetails\u003e\n\u003csummary\u003eShow Staging Schema Star Diagram\u003c/summary\u003e\n\n![Staging Star Diagram](assets/staging-star-diagram-mermaid.png)  \n*Star diagram for the staging schema, illustrating fact and dimension tables and their relationships.*\n\n\u003c/details\u003e\n\n#### `staging.dim_user_profile`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `user_id`         | `BIGINT PRIMARY KEY`     | Unique identifier for the user.      |\n| `name`            | `VARCHAR(255)` | User's name.                                   |\n| `age`             | `INTEGER`    | User's age.                                     |\n| `weight_kg`       | `DECIMAL(4,1)` | User's weight in kilograms.                    |\n| `height_cm`       | `DECIMAL(4,1)` | User's height in centimeters.                  |\n| `gender`          | `VARCHAR(50)` | User's gender.                                  |\n| `calorie_goal`    | `INTEGER`    | Daily calorie goal for the user.                |\n| `carbs_goal`      | `INTEGER`    | Daily carbohydrate goal for the user.           |\n| `protein_goal`    | `INTEGER`    | Daily protein goal for the user.                |\n| `fat_goal`        | `INTEGER`    | Daily fat goal for the user.                    |\n\n#### `staging.dim_food_item`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `food_item_id`    | `BIGINT PRIMARY KEY`     | Unique identifier for the food item.        |\n| `food_item`       | `VARCHAR(255)` | Name of the food item.                         |\n| `calories_per_100g` | `DECIMAL(4,0)` | Calories per 100 grams of the food item.     |\n| `carbs_per_100g`  | `DECIMAL(3,0)` | Carbohydrates per 100 grams of the food item. |\n| `protein_per_100g` | `DECIMAL(3,0)` | Protein per 100 grams of the food item.       |\n| `fat_per_100g`    | `DECIMAL(3,0)` | Fat per 100 grams of the food item.           |\n\n#### `staging.fact_activity_log`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `activity_id`     | `BIGINT PRIMARY KEY`     | Unique identifier for the activity record. |\n| `user_id`         | `BIGINT`     | Unique identifier for the user.                 |\n| `timestamp`       | `TIMESTAMP`  | Timestamp of the activity.                      |\n| `activity_type`   | `VARCHAR(100)` | Type of activity (e.g., walking, running).     |\n| `steps`           | `INTEGER`    | Number of steps taken during the activity.      |\n| `heart_rate`      | `INTEGER`    | Heart rate during the activity.                 |\n| `calories_burned` | `INTEGER`    | Calories burned during the activity.            |\n\n#### `staging.fact_sleep_log`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `sleep_id`        | `BIGINT PRIMARY KEY`     | Unique identifier for the sleep record. |\n| `user_id`         | `BIGINT`     | Unique identifier for the user.                 |\n| `date`            | `DATE`       | Date of the sleep record.                       |\n| `sleep_start`     | `TIMESTAMP`  | Start time of sleep.                            |\n| `sleep_end`       | `TIMESTAMP`  | End time of sleep.                              |\n| `sleep_duration_hours` | `DECIMAL(5,1)` | Duration of sleep in hours.                  |\n| `sleep_quality_score` | `INTEGER` | Quality score of sleep.                         |\n\n#### `staging.fact_nutrition_log`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `nutrition_id`    | `BIGINT PRIMARY KEY`     | Unique identifier for the nutrition record. |\n| `user_id`         | `BIGINT`     | Unique identifier for the user.                 |\n| `date`            | `DATE`       | Date of the nutrition log.                      |\n| `food_item_id`    | `BIGINT`     | Foreign key to the food item dimension.         |\n| `meal_type`       | `VARCHAR(100)` | Type of meal (e.g., breakfast, lunch).         |\n\n#### `staging.fact_goals_log`\n| Column Name       | Data Type     | Description                                      |\n|-------------------|--------------|--------------------------------------------------|\n| `goal_id`         | `BIGINT PRIMARY KEY`     | Unique identifier for the goal record.|\n| `user_id`         | `BIGINT`     | Unique identifier for the user.                 |\n| `date`            | `DATE`       | Date of the goal record.                        |\n| `goal_type`       | `VARCHAR(100)` | Type of goal (e.g., calories burned, steps taken). |\n| `target_value`    | `INTEGER`    | Target value for the goal.                      |\n| `actual_value`    | `INTEGER`    | Actual value achieved for the goal.             |\n| `status`          | `VARCHAR(50)` | Status of the goal (e.g., achieved, not achieved). |\n\n--- \n\n#### Trusted Schema\n- **Purpose**: Stores the final, fully processed data that is ready for analytics and reporting.  \n\u003cdetails\u003e\n\u003csummary\u003eShow Trusted Schema Star Diagram\u003c/summary\u003e\n\n![Trusted Star Diagram](assets/trusted-star-diagram-mermaid.png)  \n*Star diagram for the trusted schema, illustrating the relationships between tables.*\n\n\u003c/details\u003e\n\n#### `trusted.nutrition_data`\n| Column Name         | Data Type        | Description                                      |\n|---------------------|-----------------|--------------------------------------------------|\n| `nutrition_id`      | `BIGINT PRIMARY KEY` | Unique identifier for the nutrition record. |\n| `user_id`           | `BIGINT`        | Unique identifier for the user (foreign key to PII table). |\n| `date`              | `DATE`          | Date of the nutrition log.                       |\n| `food_item`         | `VARCHAR(255)`  | Name of the food item.                           |\n| `meal_type`         | `VARCHAR(100)`  | Type of meal (e.g., breakfast, lunch).           |\n| `calories_per_100g` | `DECIMAL(4,0)`  | Calories per 100 grams of the food item.         |\n| `carbs_per_100g`    | `DECIMAL(3,0)`  | Carbohydrates per 100 grams of the food item.    |\n| `protein_per_100g`  | `DECIMAL(3,0)`  | Protein per 100 grams of the food item.          |\n| `fat_per_100g`      | `DECIMAL(3,0)`  | Fat per 100 grams of the food item.              |\n\n#### `trusted.activity_data`\n| Column Name         | Data Type        | Description                                      |\n|---------------------|-----------------|--------------------------------------------------|\n| `activity_id`       | `BIGINT PRIMARY KEY` | Unique identifier for the activity record.   |\n| `user_id`           | `BIGINT`        | Unique identifier for the user (foreign key to PII table). |\n| `timestamp`         | `TIMESTAMP`     | Timestamp of the activity.                       |\n| `activity_type`     | `VARCHAR(100)`  | Type of activity (e.g., walking, running).       |\n| `steps`             | `INT`           | Number of steps taken during the activity.       |\n| `heart_rate`        | `INT`           | Heart rate during the activity.                  |\n| `calories_burned`   | `INT`           | Calories burned during the activity.             |\n\n#### `trusted.sleep_data`\n| Column Name             | Data Type        | Description                                      |\n|-------------------------|-----------------|--------------------------------------------------|\n| `sleep_id`              | `BIGINT PRIMARY KEY` | Unique identifier for the sleep record.      |\n| `user_id`               | `BIGINT`        | Unique identifier for the user (foreign key to PII table). |\n| `date`                  | `DATE`          | Date of the sleep record.                        |\n| `sleep_start`           | `TIMESTAMP`     | Start time of sleep.                             |\n| `sleep_end`             | `TIMESTAMP`     | End time of sleep.                               |\n| `sleep_duration_hours`  | `DECIMAL(5,1)`  | Duration of sleep in hours.                      |\n| `sleep_quality_score`   | `INTEGER`       | Quality score of sleep.                          |\n\n#### `trusted.goals_data`\n| Column Name         | Data Type        | Description                                      |\n|---------------------|-----------------|--------------------------------------------------|\n| `goal_id`           | `BIGINT PRIMARY KEY` | Unique identifier for the goal record.        |\n| `user_id`           | `BIGINT`        | Unique identifier for the user (foreign key to PII table). |\n| `date`              | `DATE`          | Date of the goal record.                         |\n| `goal_type`         | `VARCHAR(100)`  | Type of goal (e.g., calories burned, steps).     |\n| `target_value`      | `INT`           | Target value for the goal.                       |\n| `actual_value`      | `INT`           | Actual value achieved for the goal.              |\n| `status`            | `VARCHAR(50)`   | Status of the goal (e.g., achieved, not achieved, pending). |\n\n#### `trusted.user_profile`\n| Column Name         | Data Type        | Description                                      |\n|---------------------|-----------------|--------------------------------------------------|\n| `user_id`           | `BIGINT PRIMARY KEY` | Unique identifier for the user.              |\n| `name`              | `VARCHAR(255)`  | User's name (PII, access restricted).            |\n| `age`               | `INT`           | User's age (PII, access restricted).             |\n| `gender`            | `VARCHAR(50)`   | User's gender (PII, access restricted).          |\n\n\u003e **Note:**  \n\u003e All analytics and reporting are performed using the trusted fact tables, which only reference users by `user_id`.  \n\u003e PII is only accessible to authorized roles.\n\n--- \n\n## 5. Database Administration \u0026 Data Governance\n\n### 5.1 Database Administration\n\n- **Create DBA Roles:**  \n  - Use the scripts in [`sql/roles/`](sql/roles/) to create the `dba_role` and grant access to the `trusted` schema and its views:\n    - [`grant_usage_trusted_schema.sql`](sql/roles/grant_usage_trusted_schema.sql)\n    - [`grant_select_trusted_views.sql`](sql/roles/grant_select_trusted_views.sql)\n    - [`grant_access_pii_data.sql`](sql/roles/grant_access_pii_data.sql) (restricts PII access to DBAs)\n    - [`grant_access_non_pii_data.sql`](sql/roles/grant_access_non_pii_data.sql) (grants analytics access to non-PII data)\n    - [`grant_access_archive_user_data_pii.sql`](sql/roles/grant_access_archive_user_data_pii.sql) (restricts archive access to DBAs)\n- **Optimize DB Performance:**  \n  - Use [`explain_query_execution.sql`](sql/roles/explain_query_execution.sql) to analyze and optimize query performance.\n- **Schema Organization:**  \n  - **Schema Organization:**  \n  - All scripts for table creation, data insertion, archiving, and deletion are organized by data sensitivity in [`sql/data/`](sql/data/).\n\n---\n\n### 5.2 Data Governance\n\n- **Data Privacy \u0026 PII Handling:**\n  - PII data (e.g., user names, ages) is stored only in `trusted.user_data_pii` and managed via scripts in [`sql/data/pii/`](sql/data/pii/).\n  - Non-PII data is stored in `trusted.user_data_non_pii` (see [`sql/data/non_pii/`](sql/data/non_pii/)), where user identifiers are hashed and ages are grouped.\n  - Access to PII tables is strictly limited to users with the `dba_role`.\n- **Data Lifecycle Management:**\n  - Old or inactive records are archived using scripts in [`sql/data/archive/`](sql/data/archive/).\n  - Archived records are deleted from the main tables only after successful archival, ensuring no data loss.\n- **Data Quality Assurance:**\n  - All trusted tables enforce strong data quality constraints (`NOT NULL`, `CHECK`, valid value lists) at the schema level.\n  - ETL scripts filter out incomplete or invalid records before loading into trusted tables.\n- **Security \u0026 Access Control:**\n  - Role-based access is enforced at the schema and table level.\n  - PII data is never exposed to analytics or reporting users.\n- **Documentation:**\n  - All scripts and policies are documented in this README for transparency and auditability.\n\n---\n\n---\n\n## 6. Graphical User Interface (GUI)\n\nA simple GUI is included to make running the ETL pipeline and generating dashboards more user-friendly.\n\n### How it Works\n\n- The GUI is built with [Tkinter](https://docs.python.org/3/library/tkinter.html).\n- Launch it with:\n  ```sh\n  python src/interface/main.py\n  ```\n- The interface will prompt you to enter:\n  - Number of Users\n  - Number of Days\n  - Database Host, Port, User, Password, and Name\n  - USDA API Key\n\nWhen you click **\"Run Pipeline\"**, the GUI passes your input as command-line arguments and environment variables to the ETL scripts. The scripts are then run in sequence: extract, transform, load, and dashboard generation.\n\n**File location:**  \n[`src/interface/main.py`](src/interface/main.py)\n\n**Important:**  \n- If you use the GUI, your input for number of users and days will be used for that run.\n- If you run the extract script directly (e.g., `python src/extract/healthapp.py`), it will use the default global variables (`NO_USERS` and `NO_DAYS`) defined in the script.\n\u003e **Note:**  \n\u003e For larger numbers of users or days, the pipeline will take longer to complete. \u003cbr\u003e\n\u003e This is because the USDA API is called to retrieve food information for every user and day, which can be time-consuming.\n\n**Example GUI window:**  \n\u003cdetails\u003e\n\u003csummary id=\"gui-example\"\u003eShow Example GUI window\u003c/summary\u003e\n\n![GUI Example](assets/gui-example.png)  \n*The GUI for running the ETL pipeline and dashboards with custom parameters.*\n\n\u003c/details\u003e\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelinteflxrin%2Fsoftserve-bigdata-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmelinteflxrin%2Fsoftserve-bigdata-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelinteflxrin%2Fsoftserve-bigdata-project/lists"}