https://github.com/lioccoumd/etl-analysis
This project automates ETL for gym exercise data, predicting safety scores using KNN and optimizing with GridSearchCV. It generates recommendations, statistical summaries, and visualizations to improve gym safety and client retention. Logging ensures transparency.
https://github.com/lioccoumd/etl-analysis
automation big-data business-problem dataset etl etl-automation etl-pipeline etl-process logging python solo-project visualization
Last synced: 4 months ago
JSON representation
This project automates ETL for gym exercise data, predicting safety scores using KNN and optimizing with GridSearchCV. It generates recommendations, statistical summaries, and visualizations to improve gym safety and client retention. Logging ensures transparency.
- Host: GitHub
- URL: https://github.com/lioccoumd/etl-analysis
- Owner: LIoccoUMD
- Created: 2024-07-28T15:42:56.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-25T17:43:14.000Z (over 1 year ago)
- Last Synced: 2025-04-06T22:12:38.783Z (about 1 year ago)
- Topics: automation, big-data, business-problem, dataset, etl, etl-automation, etl-pipeline, etl-process, logging, python, solo-project, visualization
- Language: Python
- Homepage:
- Size: 1.72 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Gym Safety ETL and Analysis
## Project Overview
This project involves extracting, transforming, and analyzing a dataset of gym exercises. The analysis includes calculating safety scores for exercises, evaluating models, and generating visualizations. The process is automated through a series of Python scripts, with enhanced interactivity
and logging for better usability and debugging.
### Business Problem
Unsafe lifting practices in gyms pose risks to member safety, which can be mitigated by providing easily accessible, data-driven visualizations of proper exercise techniques categorized by difficulty and muscle group. The project aims to develop a recommendation system that
enhances client satisfaction and reduces injury rates, directly contributing to higher retention and client loyalty.
### Data Sets Used
- `megaGymDataset.csv`: Contains data on various exercises, including type, body part, equipment, difficulty level, rating, and description.
- `dataset-metadata.json`: Metadata for the datasets.
### Techniques Employed
- **Mean Imputation**: Handling missing values by imputing the mean rating for each exercise level.
- **Encoding Categorical Variables**
- **K-Nearest Neighbor to predict safety scores**
- GridSearchCV to find the optimal number of neighbors
- MSE, MAE, R^2 metrics
### Expected Outputs
- Analysis/evaluation/visualizations in clear, readable files
- Summary statistics of the exercise dataset.
- Visualization of exercises grouped into clusters
- Recommendations of exercises based on their difficulty.
- Logging to all files
- Non-technical visualiztion for the user
- Technical visulation to represent model performance
# Setup Instructions
## Setting Up Kaggle API Keys
To run this project, you may need access to datasets hosted on Kaggle. Follow the steps below to set up your Kaggle API keys:
1. **Obtain Your Kaggle API Key:**
- Log in to your Kaggle account.
- Go to your account settings by clicking on your profile picture in the top right corner and selecting "Account."
- Scroll down to the "API" section and click "Create New API Token."
- A file named `kaggle.json` will be downloaded, containing your Kaggle API credentials.
2. **Place the API Key:**
- Move the `kaggle.json` file to a secure location:
- **Windows:** `C:\Users\\.kaggle\kaggle.json`
- Ensure that the `.kaggle` directory is hidden and that the `kaggle.json` file is accessible only by you.
3. **Using the API Key in This Project:**
- The Kaggle API is required to download datasets automatically when you run the scripts.
- Ensure you have the Kaggle Python package installed:
```sh
python -m pip install kaggle
```
- Authenticate your Kaggle API in your scripts:
```python
import kaggle
kaggle.api.authenticate()
```
- The datasets will be automatically downloaded using the API when you run the project.
## Cloning the Repository
Clone the repository to your local machine using the following command:
`git clone https://github.com/username/inst414-final-project-luciano-iocco.git`
Create a virtual environment and select the most recent version of Python. The current working Python version is 3.11.1. requirements.txt contains all of the dependencies needed to run this project. Install the required packages using `python -m pip install -r requirements.txt`
Run the main script to execute the ETL process and analysis (effectively run the entire program) `python main.py`
## Logging
Logging is configured to write to "gym_project.log". The log includes detailed information about each step of the process, including any errors that occur along with
their time, level, and a message.
# Code Package Structure
### **data/**
- **`downloaded/`**: Stores raw datasets fetched from sources.
- **`processed/`**: Holds processed data files after transformation.
### **outputs/**
- **`descriptive_analysis.csv`**: Results from the descriptive analysis script.
- **`prescriptive_analysis.csv`**: Results from the prescriptive analysis script.
### **analysis/**
- **`descriptive_analysis.py`**: Conducts descriptive statistical analysis.
- **`prescriptive_analysis.py`**: Evaluates models and generates recommendations
### **etl/**
- **`extract.py`**: Loads raw dataset into a DataFrame.
- **`transform.py`**: Processes data, manages missing values, and computes exercise safety scores.
### **vis/**
- **`visualizations.py`**: Creates visualizations for data insights and results.
### **log/**
- Automatically stores logging output.
- **`main.py`** outputs logs to `gym_project.log`.