https://github.com/ahmed122000/ml_model_deployment
The HR Analytics: Job Change Predictor is a Flask-based web application that uses machine learning to predict whether an employee will stay with a company or leave. It allows users to train models, evaluate their performance, and make predictions based on employee data, providing valuable insights for HR decision-making.
https://github.com/ahmed122000/ml_model_deployment
classification flask machine-learning python3 rest-api scikit-learn
Last synced: about 2 months ago
JSON representation
The HR Analytics: Job Change Predictor is a Flask-based web application that uses machine learning to predict whether an employee will stay with a company or leave. It allows users to train models, evaluate their performance, and make predictions based on employee data, providing valuable insights for HR decision-making.
- Host: GitHub
- URL: https://github.com/ahmed122000/ml_model_deployment
- Owner: Ahmed122000
- Created: 2022-01-10T15:49:29.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-12-25T20:52:53.000Z (over 1 year ago)
- Last Synced: 2025-03-28T17:21:18.497Z (about 1 year ago)
- Topics: classification, flask, machine-learning, python3, rest-api, scikit-learn
- Language: HTML
- Homepage:
- Size: 626 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ง ML Model Deployment - HR Analytics Job Change Predictor
[](https://www.python.org/)
[](https://flask.palletsprojects.com/)
[](https://scikit-learn.org/)
[](https://pandas.pydata.org/)
[](LICENSE)
> A production-ready Flask web application that predicts whether a data scientist will stay with a company or leave. Features machine learning model training, evaluation, and interactive predictions with data balancing techniques.
---
## ๐ Table of Contents
- [Overview](#-overview)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Project Structure](#-project-structure)
- [Installation](#-installation)
- [Usage](#-usage)
- [Machine Learning Models](#-machine-learning-models)
- [Dataset](#-dataset)
- [Results & Performance](#-results--performance)
- [API Endpoints](#-api-endpoints)
- [Deployment](#-deployment)
- [License](#-license)
---
## ๐ Overview
This project builds a predictive model to determine whether data scientists will remain with their current employer or leave for better opportunities. The application provides:
- **Multiple ML algorithms** comparison
- **Data balancing** techniques (oversampling, undersampling)
- **Interactive training interface** for experimentation
- **Real-time predictions** on new employee data
- **Detailed evaluation metrics** and classification reports
**Business Value**: HR departments can identify at-risk employees and implement retention strategies.
---
## โจ Features
### ๐ค Machine Learning Capabilities
| Feature | Description |
|---------|-------------|
| **Multiple Algorithms** | Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) |
| **Data Balancing** | Handle imbalanced classes with oversampling, undersampling, SMOTE |
| **Cross-Validation** | K-Fold validation for robust model evaluation |
| **Hyperparameter Tuning** | GridSearchCV for optimal parameters |
| **Model Persistence** | Save/load trained models with joblib |
| **Feature Scaling** | StandardScaler for optimal algorithm performance |
### ๐ Analysis & Reporting
| Feature | Description |
|---------|-------------|
| **Classification Metrics** | Precision, Recall, F1-Score, Accuracy, AUC-ROC |
| **Confusion Matrix** | Visual confusion matrix visualization |
| **Train/Test Scores** | Detailed performance on training and test sets |
| **Classification Report** | Per-class precision, recall, F1-score |
| **Feature Importance** | Identify most influential features |
| **ROC Curves** | Receiver Operating Characteristic analysis |
### ๐ฏ Prediction Features
| Feature | Description |
|---------|-------------|
| **Batch Predictions** | Predict on multiple employees at once |
| **Confidence Scores** | Probability of staying vs leaving |
| **Feature-wise Explanation** | Understand prediction reasoning |
| **Historical Comparisons** | Track prediction accuracy over time |
### ๐ฅ๏ธ User Interface
| Feature | Description |
|---------|-------------|
| **Interactive Dashboard** | Real-time model performance visualization |
| **Model Comparison** | Compare different algorithms side-by-side |
| **Training History** | Track all trained models and their metrics |
| **Download Reports** | Export predictions and analysis as CSV/PDF |
---
## ๐ ๏ธ Tech Stack
| Component | Technology |
|-----------|-----------|
| **Backend** | Python 3.8+, Flask 2.0 |
| **ML Libraries** | scikit-learn, XGBoost, LightGBM |
| **Data Processing** | Pandas, NumPy |
| **Visualization** | Matplotlib, Seaborn, Plotly |
| **Model Storage** | joblib |
| **Frontend** | HTML5, CSS3, JavaScript, Bootstrap |
| **Deployment** | Gunicorn, Docker |
---
## ๐ Project Structure
```plaintext
ml-model-deployment/
โโโ main.py # Flask application entry point
โโโ train.py # Model training logic
โโโ predict.py # Prediction logic
โโโ evaluate.py # Model evaluation
โโโ data_processor.py # Data loading & preprocessing
โโโ config.py # Configuration settings
โ
โโโ requirements.txt # Python dependencies
โโโ Dockerfile # Container configuration
โโโ docker-compose.yml # Multi-container setup
โ
โโโ data/ # Training datasets
โ โโโ normal_data.csv # Original balanced data
โ โโโ oversample.csv # Oversampled data
โ โโโ undersample_data.csv # Undersampled data
โ
โโโ models/ # Saved trained models
โ โโโ lr_model.pkl # Logistic Regression
โ โโโ knn_model.pkl # KNN model
โ โโโ svm_model.pkl # SVM model
โ โโโ scalers/ # Feature scalers
โ
โโโ templates/ # HTML templates
โ โโโ base.html # Base template
โ โโโ index.html # Home page
โ โโโ train.html # Training interface
โ โโโ predict.html # Prediction interface
โ โโโ results.html # Results display
โ โโโ dashboard.html # Analytics dashboard
โ
โโโ static/ # Static files
โ โโโ css/
โ โ โโโ style.css # Custom styling
โ โ โโโ bootstrap.min.css
โ โโโ js/
โ โ โโโ script.js # Client-side logic
โ โ โโโ charts.js # Chart generation
โ โโโ images/ # UI images
โ
โโโ README.md # This file
```
---
## ๐ Installation
### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
- Virtual environment (recommended)
- 2GB RAM minimum
### Step-by-Step Setup
1. **Clone repository**:
```bash
git clone https://github.com/Ahmed122000/ML_model_deployment.git
cd ML_model_deployment
```
2. **Create virtual environment**:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Prepare datasets**:
```bash
# Ensure these files exist in data/ directory:
# - normal_data.csv
# - oversample.csv
# - undersample_data.csv
```
5. **Run application**:
```bash
python main.py
```
6. **Access application**:
```
http://localhost:5000
```
---
## ๐ก Usage
### Web Interface Navigation
#### 1๏ธโฃ Home Page
- Project overview
- Quick links to train/predict
- Model statistics
#### 2๏ธโฃ Training Models
**Steps**:
1. Navigate to "Train Models" tab
2. **Select Dataset**:
- Normal (original data)
- Oversampled (more minority class samples)
- Undersampled (fewer majority class samples)
3. **Choose Algorithm**:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
4. **Optional**: Adjust hyperparameters
5. Click "Train Model"
6. View results:
- Train/Test scores
- Classification report
- Confusion matrix
- Feature importance
**Training Output**:
```
Model: Logistic Regression
Dataset: Oversampled
Train Score: 0.8245
Test Score: 0.7893
Precision: 0.8102
Recall: 0.7654
F1-Score: 0.7873
```
#### 3๏ธโฃ Making Predictions
**Steps**:
1. Navigate to "Predict" tab
2. Fill employee information:
- City Development Index (0.0 - 1.0)
- Gender (M/F)
- Relevant Experience (Yes/No)
- Enrolled in University (Yes/No)
- Education Level (High School/Bachelor/Master/PhD)
- Major Discipline
- Experience (years)
- Company Size
- Company Type
- Last New Job (years)
- Training Hours
3. Click "Predict"
4. View prediction result:
- **Will Stay** or **Will Leave**
- Confidence percentage
- Feature contributions
#### 4๏ธโฃ Dashboard
- Compare all trained models
- View training history
- Analyze feature importance across models
- Export reports
---
## ๐ค Machine Learning Models
### 1. Logistic Regression
**When to use**: Baseline model, interpretable results
**Parameters**:
```python
LogisticRegression(
max_iter=1000,
random_state=42,
class_weight='balanced'
)
```
**Pros**:
- Fast training
- Highly interpretable
- Good for linearly separable data
**Cons**:
- Assumes linear relationship
- Less effective with complex patterns
---
### 2. K-Nearest Neighbors (KNN)
**When to use**: Non-linear patterns, small-medium datasets
**Parameters**:
```python
KNeighborsClassifier(
n_neighbors=5,
weights='distance',
metric='euclidean'
)
```
**Pros**:
- Captures non-linear patterns
- No training phase
- Effective for local patterns
**Cons**:
- Slow prediction time
- Sensitive to feature scaling
- Memory intensive
---
### 3. Support Vector Machine (SVM)
**When to use**: High-dimensional data, maximum margin classification
**Parameters**:
```python
SVC(
kernel='rbf',
C=1.0,
gamma='scale',
probability=True,
random_state=42
)
```
**Pros**:
- Effective in high dimensions
- Robust to outliers
- Strong theoretical foundation
**Cons**:
- Slower training
- Requires feature scaling
- Hard to interpret
---
### Data Balancing Techniques
#### Original Distribution
```
Staying: 75% (majority)
Leaving: 25% (minority)
```
#### Oversampling
```
Randomly duplicate minority class samples
Result: 75% vs 75% balanced distribution
```
#### Undersampling
```
Randomly remove majority class samples
Result: 25% vs 25% balanced distribution
```
---
## ๐ Dataset
### Features (12 input features)
| Feature | Type | Range/Values | Description |
|---------|------|--------------|-------------|
| city_development_index | float | 0.0 - 1.0 | City development level |
| gender | categorical | M/F | Employee gender |
| relevant_experience | binary | Yes/No | Has relevant experience |
| enrolled_university | categorical | Full-time/Part-time/No | University enrollment |
| education_level | categorical | HS/Bachelor/Master/PhD | Highest education |
| major_discipline | categorical | STEM/Business/Humanities | Field of study |
| experience | integer | 0-50 | Years of experience |
| company_size | categorical | Startup/MNC/Unicorn | Company size |
| company_type | categorical | IT/Service/Healthcare | Industry type |
| last_new_job | integer | 0-5 | Years at current job |
| training_hours | integer | 0-500 | Professional training hours |
| **target** | **binary** | **0/1** | **0=Stays, 1=Leaves** |
### Dataset Size
- **Total Records**: 19,158 employees
- **Training Set**: 70% (13,410 records)
- **Test Set**: 30% (5,748 records)
- **Missing Values**: < 2% (handled)
- **Class Imbalance**: 75% vs 25%
### Data Preprocessing
```python
# Steps applied:
1. Load CSV data
2. Handle missing values (mean/mode imputation)
3. Encode categorical variables (LabelEncoder)
4. Scale numerical features (StandardScaler)
5. Split train/test (80/20)
6. Handle class imbalance (oversample/undersample)
```
---
## ๐ Results & Performance
### Model Comparison (on test set)
| Metric | Logistic Regression | KNN (k=5) | SVM (RBF) |
|--------|-------------------|-----------|----------|
| **Accuracy** | 78.23% | 76.45% | 79.12% |
| **Precision** | 0.7891 | 0.7654 | 0.8023 |
| **Recall** | 0.7456 | 0.7234 | 0.7789 |
| **F1-Score** | 0.7667 | 0.7440 | 0.7904 |
| **AUC-ROC** | 0.8234 | 0.8012 | 0.8456 |
| **Training Time** | 2.3s | 0.5s | 45.2s |
### Best Performing Model: SVM
- Highest accuracy and F1-score
- Good balance between precision and recall
- Acceptable training time
---
## ๐ API Endpoints
### Flask Routes
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/` | GET | Home page |
| `/train` | GET, POST | Training interface |
| `/predict` | GET, POST | Prediction interface |
| `/results` | GET | View training results |
| `/dashboard` | GET | Analytics dashboard |
| `/api/train-model` | POST | Train model (JSON API) |
| `/api/predict` | POST | Make prediction (JSON API) |
| `/api/models` | GET | List trained models |
| `/api/export` | GET | Export results as CSV |
### API Examples
**Train Model**:
```bash
curl -X POST http://localhost:5000/api/train-model \
-H "Content-Type: application/json" \
-d '{
"algorithm": "svm",
"dataset": "oversample"
}'
```
**Make Prediction**:
```bash
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{
"city_development_index": 0.92,
"gender": "M",
"relevant_experience": "Yes",
"experience": 3,
"training_hours": 40
}'
```
---
## ๐ณ Deployment
### Docker Setup
1. **Build image**:
```bash
docker build -t ml-predictor:latest .
```
2. **Run container**:
```bash
docker run -p 5000:5000 ml-predictor:latest
```
3. **Using Docker Compose**:
```bash
docker-compose up
```
### Production Deployment
**Using Gunicorn**:
```bash
gunicorn --workers 4 --bind 0.0.0.0:5000 main:app
```
**On Heroku**:
```bash
heroku login
heroku create ml-predictor
git push heroku main
```
---
## ๐งช Testing
### Run Tests
```bash
python -m pytest tests/
```
### Test Coverage
- Unit tests for model training
- Integration tests for API endpoints
- Data preprocessing tests
- Prediction accuracy tests
---
## ๐ Troubleshooting
### Issue: "ModuleNotFoundError"
**Solution**: Install requirements
```bash
pip install -r requirements.txt
```
### Issue: "FileNotFoundError: data files"
**Solution**: Ensure CSV files exist in `data/` directory
### Issue: "Port 5000 already in use"
**Solution**: Use different port
```bash
python main.py --port 5001
```
---
## ๐ Future Enhancements
- [ ] Deep learning models (Neural Networks)
- [ ] Real-time data streaming
- [ ] Advanced feature engineering
- [ ] Model explainability (SHAP, LIME)
- [ ] A/B testing framework
- [ ] Automated retraining pipeline
- [ ] Mobile app integration
- [ ] Multi-language support
- [ ] Advanced visualization dashboards
- [ ] REST API v2
---
## ๐ Contributing
1. Fork repository
2. Create feature branch (`git checkout -b feature/improvement`)
3. Commit changes (`git commit -m 'Add improvement'`)
4. Push to branch (`git push origin feature/improvement`)
5. Open Pull Request
---
## ๐ License
This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.
---
## ๐ Acknowledgments
- [Kaggle](https://www.kaggle.com/) - HR Analytics dataset
- [scikit-learn](https://scikit-learn.org/) - ML algorithms
- [Flask](https://flask.palletsprojects.com/) - Web framework
- [Pandas](https://pandas.pydata.org/) - Data processing
---
## ๐จโ๐ป Author
**Ahmed Hesham** - [@Ahmed122000](https://github.com/Ahmed122000)
**Built with โค๏ธ for HR Analytics & ML Deployment**