{"id":26514713,"url":"https://github.com/lioccoumd/etl-analysis","last_synced_at":"2026-02-17T18:36:56.015Z","repository":{"id":279466789,"uuid":"834864877","full_name":"LIoccoUMD/ETL-Analysis","owner":"LIoccoUMD","description":"This project automates ETL for gym exercise data, predicting safety scores using KNN and optimizing with GridSearchCV. It generates recommendations, statistical summaries, and visualizations to improve gym safety and client retention. Logging ensures transparency.","archived":false,"fork":false,"pushed_at":"2025-02-25T17:43:14.000Z","size":1808,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-06T22:12:38.783Z","etag":null,"topics":["automation","big-data","business-problem","dataset","etl","etl-automation","etl-pipeline","etl-process","logging","python","solo-project","visualization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LIoccoUMD.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-28T15:42:56.000Z","updated_at":"2025-02-25T17:43:18.000Z","dependencies_parsed_at":"2025-02-25T18:48:54.585Z","dependency_job_id":null,"html_url":"https://github.com/LIoccoUMD/ETL-Analysis","commit_stats":null,"previous_names":["lioccoumd/etl-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LIoccoUMD/ETL-Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LIoccoUMD%2FETL-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LIoccoUMD%2FETL-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LIoccoUMD%2FETL-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LIoccoUMD%2FETL-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LIoccoUMD","download_url":"https://codeload.github.com/LIoccoUMD/ETL-Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LIoccoUMD%2FETL-Analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29552799,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T18:16:07.221Z","status":"ssl_error","status_checked_at":"2026-02-17T18:16:04.782Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","big-data","business-problem","dataset","etl","etl-automation","etl-pipeline","etl-process","logging","python","solo-project","visualization"],"created_at":"2025-03-21T05:29:10.402Z","updated_at":"2026-02-17T18:36:55.982Z","avatar_url":"https://github.com/LIoccoUMD.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gym Safety ETL and Analysis\n\n## Project Overview\n\nThis project involves extracting, transforming, and analyzing a dataset of gym exercises. The analysis includes calculating safety scores for exercises, evaluating models, and generating visualizations. The process is automated through a series of Python scripts, with enhanced interactivity\nand logging for better usability and debugging.\n\n### Business Problem\nUnsafe lifting practices in gyms pose risks to member safety, which can be mitigated by providing easily accessible, data-driven visualizations of proper exercise techniques categorized by difficulty and muscle group. The project aims to develop a recommendation system that\nenhances client satisfaction and reduces injury rates, directly contributing to higher retention and client loyalty.\n\n### Data Sets Used\n- `megaGymDataset.csv`: Contains data on various exercises, including type, body part, equipment, difficulty level, rating, and description.\n- `dataset-metadata.json`: Metadata for the datasets.\n\n### Techniques Employed\n- **Mean Imputation**: Handling missing values by imputing the mean rating for each exercise level.  \n- **Encoding Categorical Variables**  \n- **K-Nearest Neighbor to predict safety scores**  \n    - GridSearchCV to find the optimal number of neighbors\n    - MSE, MAE, R^2 metrics\n\n### Expected Outputs\n- Analysis/evaluation/visualizations in clear, readable files\n- Summary statistics of the exercise dataset.\n- Visualization of exercises grouped into clusters\n- Recommendations of exercises based on their difficulty.\n- Logging to all files\n- Non-technical visualiztion for the user\n- Technical visulation to represent model performance\n\n# Setup Instructions\n\n## Setting Up Kaggle API Keys\n\nTo run this project, you may need access to datasets hosted on Kaggle. Follow the steps below to set up your Kaggle API keys:\n\n1. **Obtain Your Kaggle API Key:**\n    - Log in to your Kaggle account.\n    - Go to your account settings by clicking on your profile picture in the top right corner and selecting \"Account.\"\n    - Scroll down to the \"API\" section and click \"Create New API Token.\"\n    - A file named `kaggle.json` will be downloaded, containing your Kaggle API credentials.\n\n2. **Place the API Key:**\n    - Move the `kaggle.json` file to a secure location:\n        - **Windows:** `C:\\Users\\\u003cYourUsername\u003e\\.kaggle\\kaggle.json`\n    - Ensure that the `.kaggle` directory is hidden and that the `kaggle.json` file is accessible only by you.\n\n3. **Using the API Key in This Project:**\n    - The Kaggle API is required to download datasets automatically when you run the scripts.\n    - Ensure you have the Kaggle Python package installed:\n      ```sh\n      python -m pip install kaggle\n      ```\n    - Authenticate your Kaggle API in your scripts:\n      ```python\n      import kaggle\n      kaggle.api.authenticate()\n      ```\n    - The datasets will be automatically downloaded using the API when you run the project.\n\n\n\n## Cloning the Repository\nClone the repository to your local machine using the following command:  \n`git clone https://github.com/username/inst414-final-project-luciano-iocco.git`  \nCreate a virtual environment and select the most recent version of Python. The current working Python version is 3.11.1. requirements.txt contains all of the dependencies   needed to run this project. Install the required packages using `python -m pip install -r requirements.txt`  \nRun the main script to execute the ETL process and analysis (effectively run the entire program) `python main.py`\n\n## Logging\n\nLogging is configured to write to \"gym_project.log\". The log includes detailed information about each step of the process, including any errors that occur along with \ntheir time, level, and a message.\n\n# Code Package Structure\n\n### **data/**\n- **`downloaded/`**: Stores raw datasets fetched from sources.\n- **`processed/`**: Holds processed data files after transformation.\n\n### **outputs/**\n- **`descriptive_analysis.csv`**: Results from the descriptive analysis script.\n- **`prescriptive_analysis.csv`**: Results from the prescriptive analysis script.\n\n### **analysis/**\n- **`descriptive_analysis.py`**: Conducts descriptive statistical analysis.\n- **`prescriptive_analysis.py`**: Evaluates models and generates recommendations\n\n### **etl/**\n- **`extract.py`**: Loads raw dataset into a DataFrame.\n- **`transform.py`**: Processes data, manages missing values, and computes exercise safety scores.\n\n### **vis/**\n- **`visualizations.py`**: Creates visualizations for data insights and results.\n\n### **log/**\n- Automatically stores logging output.\n- **`main.py`** outputs logs to `gym_project.log`.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flioccoumd%2Fetl-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flioccoumd%2Fetl-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flioccoumd%2Fetl-analysis/lists"}