An open API service indexing awesome lists of open source software.

https://github.com/ahmed122000/ml_model_deployment

The HR Analytics: Job Change Predictor is a Flask-based web application that uses machine learning to predict whether an employee will stay with a company or leave. It allows users to train models, evaluate their performance, and make predictions based on employee data, providing valuable insights for HR decision-making.
https://github.com/ahmed122000/ml_model_deployment

classification flask machine-learning python3 rest-api scikit-learn

Last synced: about 2 months ago
JSON representation

The HR Analytics: Job Change Predictor is a Flask-based web application that uses machine learning to predict whether an employee will stay with a company or leave. It allows users to train models, evaluate their performance, and make predictions based on employee data, providing valuable insights for HR decision-making.

Awesome Lists containing this project

README

          

# ๐Ÿง  ML Model Deployment - HR Analytics Job Change Predictor

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python)](https://www.python.org/)
[![Flask](https://img.shields.io/badge/Flask-2.0-black?style=flat-square&logo=flask)](https://flask.palletsprojects.com/)
[![scikit-learn](https://img.shields.io/badge/scikit--learn-1.0-orange?style=flat-square&logo=scikit-learn)](https://scikit-learn.org/)
[![Pandas](https://img.shields.io/badge/Pandas-1.3-green?style=flat-square&logo=pandas)](https://pandas.pydata.org/)
[![License](https://img.shields.io/badge/License-MIT-black?style=flat-square)](LICENSE)

> A production-ready Flask web application that predicts whether a data scientist will stay with a company or leave. Features machine learning model training, evaluation, and interactive predictions with data balancing techniques.

---

## ๐Ÿ“‘ Table of Contents

- [Overview](#-overview)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Project Structure](#-project-structure)
- [Installation](#-installation)
- [Usage](#-usage)
- [Machine Learning Models](#-machine-learning-models)
- [Dataset](#-dataset)
- [Results & Performance](#-results--performance)
- [API Endpoints](#-api-endpoints)
- [Deployment](#-deployment)
- [License](#-license)

---

## ๐Ÿ“Š Overview

This project builds a predictive model to determine whether data scientists will remain with their current employer or leave for better opportunities. The application provides:

- **Multiple ML algorithms** comparison
- **Data balancing** techniques (oversampling, undersampling)
- **Interactive training interface** for experimentation
- **Real-time predictions** on new employee data
- **Detailed evaluation metrics** and classification reports

**Business Value**: HR departments can identify at-risk employees and implement retention strategies.

---

## โœจ Features

### ๐Ÿค– Machine Learning Capabilities

| Feature | Description |
|---------|-------------|
| **Multiple Algorithms** | Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) |
| **Data Balancing** | Handle imbalanced classes with oversampling, undersampling, SMOTE |
| **Cross-Validation** | K-Fold validation for robust model evaluation |
| **Hyperparameter Tuning** | GridSearchCV for optimal parameters |
| **Model Persistence** | Save/load trained models with joblib |
| **Feature Scaling** | StandardScaler for optimal algorithm performance |

### ๐Ÿ“ˆ Analysis & Reporting

| Feature | Description |
|---------|-------------|
| **Classification Metrics** | Precision, Recall, F1-Score, Accuracy, AUC-ROC |
| **Confusion Matrix** | Visual confusion matrix visualization |
| **Train/Test Scores** | Detailed performance on training and test sets |
| **Classification Report** | Per-class precision, recall, F1-score |
| **Feature Importance** | Identify most influential features |
| **ROC Curves** | Receiver Operating Characteristic analysis |

### ๐ŸŽฏ Prediction Features

| Feature | Description |
|---------|-------------|
| **Batch Predictions** | Predict on multiple employees at once |
| **Confidence Scores** | Probability of staying vs leaving |
| **Feature-wise Explanation** | Understand prediction reasoning |
| **Historical Comparisons** | Track prediction accuracy over time |

### ๐Ÿ–ฅ๏ธ User Interface

| Feature | Description |
|---------|-------------|
| **Interactive Dashboard** | Real-time model performance visualization |
| **Model Comparison** | Compare different algorithms side-by-side |
| **Training History** | Track all trained models and their metrics |
| **Download Reports** | Export predictions and analysis as CSV/PDF |

---

## ๐Ÿ› ๏ธ Tech Stack

| Component | Technology |
|-----------|-----------|
| **Backend** | Python 3.8+, Flask 2.0 |
| **ML Libraries** | scikit-learn, XGBoost, LightGBM |
| **Data Processing** | Pandas, NumPy |
| **Visualization** | Matplotlib, Seaborn, Plotly |
| **Model Storage** | joblib |
| **Frontend** | HTML5, CSS3, JavaScript, Bootstrap |
| **Deployment** | Gunicorn, Docker |

---

## ๐Ÿ“‚ Project Structure

```plaintext
ml-model-deployment/
โ”œโ”€โ”€ main.py # Flask application entry point
โ”œโ”€โ”€ train.py # Model training logic
โ”œโ”€โ”€ predict.py # Prediction logic
โ”œโ”€โ”€ evaluate.py # Model evaluation
โ”œโ”€โ”€ data_processor.py # Data loading & preprocessing
โ”œโ”€โ”€ config.py # Configuration settings
โ”‚
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ Dockerfile # Container configuration
โ”œโ”€โ”€ docker-compose.yml # Multi-container setup
โ”‚
โ”œโ”€โ”€ data/ # Training datasets
โ”‚ โ”œโ”€โ”€ normal_data.csv # Original balanced data
โ”‚ โ”œโ”€โ”€ oversample.csv # Oversampled data
โ”‚ โ””โ”€โ”€ undersample_data.csv # Undersampled data
โ”‚
โ”œโ”€โ”€ models/ # Saved trained models
โ”‚ โ”œโ”€โ”€ lr_model.pkl # Logistic Regression
โ”‚ โ”œโ”€โ”€ knn_model.pkl # KNN model
โ”‚ โ”œโ”€โ”€ svm_model.pkl # SVM model
โ”‚ โ””โ”€โ”€ scalers/ # Feature scalers
โ”‚
โ”œโ”€โ”€ templates/ # HTML templates
โ”‚ โ”œโ”€โ”€ base.html # Base template
โ”‚ โ”œโ”€โ”€ index.html # Home page
โ”‚ โ”œโ”€โ”€ train.html # Training interface
โ”‚ โ”œโ”€โ”€ predict.html # Prediction interface
โ”‚ โ”œโ”€โ”€ results.html # Results display
โ”‚ โ””โ”€โ”€ dashboard.html # Analytics dashboard
โ”‚
โ”œโ”€โ”€ static/ # Static files
โ”‚ โ”œโ”€โ”€ css/
โ”‚ โ”‚ โ”œโ”€โ”€ style.css # Custom styling
โ”‚ โ”‚ โ””โ”€โ”€ bootstrap.min.css
โ”‚ โ”œโ”€โ”€ js/
โ”‚ โ”‚ โ”œโ”€โ”€ script.js # Client-side logic
โ”‚ โ”‚ โ””โ”€โ”€ charts.js # Chart generation
โ”‚ โ””โ”€โ”€ images/ # UI images
โ”‚
โ””โ”€โ”€ README.md # This file
```

---

## ๐Ÿš€ Installation

### Prerequisites

- Python 3.8 or higher
- pip (Python package manager)
- Virtual environment (recommended)
- 2GB RAM minimum

### Step-by-Step Setup

1. **Clone repository**:
```bash
git clone https://github.com/Ahmed122000/ML_model_deployment.git
cd ML_model_deployment
```

2. **Create virtual environment**:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. **Install dependencies**:
```bash
pip install -r requirements.txt
```

4. **Prepare datasets**:
```bash
# Ensure these files exist in data/ directory:
# - normal_data.csv
# - oversample.csv
# - undersample_data.csv
```

5. **Run application**:
```bash
python main.py
```

6. **Access application**:
```
http://localhost:5000
```

---

## ๐Ÿ’ก Usage

### Web Interface Navigation

#### 1๏ธโƒฃ Home Page
- Project overview
- Quick links to train/predict
- Model statistics

#### 2๏ธโƒฃ Training Models

**Steps**:
1. Navigate to "Train Models" tab
2. **Select Dataset**:
- Normal (original data)
- Oversampled (more minority class samples)
- Undersampled (fewer majority class samples)
3. **Choose Algorithm**:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
4. **Optional**: Adjust hyperparameters
5. Click "Train Model"
6. View results:
- Train/Test scores
- Classification report
- Confusion matrix
- Feature importance

**Training Output**:
```
Model: Logistic Regression
Dataset: Oversampled
Train Score: 0.8245
Test Score: 0.7893
Precision: 0.8102
Recall: 0.7654
F1-Score: 0.7873
```

#### 3๏ธโƒฃ Making Predictions

**Steps**:
1. Navigate to "Predict" tab
2. Fill employee information:
- City Development Index (0.0 - 1.0)
- Gender (M/F)
- Relevant Experience (Yes/No)
- Enrolled in University (Yes/No)
- Education Level (High School/Bachelor/Master/PhD)
- Major Discipline
- Experience (years)
- Company Size
- Company Type
- Last New Job (years)
- Training Hours

3. Click "Predict"
4. View prediction result:
- **Will Stay** or **Will Leave**
- Confidence percentage
- Feature contributions

#### 4๏ธโƒฃ Dashboard

- Compare all trained models
- View training history
- Analyze feature importance across models
- Export reports

---

## ๐Ÿค– Machine Learning Models

### 1. Logistic Regression

**When to use**: Baseline model, interpretable results

**Parameters**:
```python
LogisticRegression(
max_iter=1000,
random_state=42,
class_weight='balanced'
)
```

**Pros**:
- Fast training
- Highly interpretable
- Good for linearly separable data

**Cons**:
- Assumes linear relationship
- Less effective with complex patterns

---

### 2. K-Nearest Neighbors (KNN)

**When to use**: Non-linear patterns, small-medium datasets

**Parameters**:
```python
KNeighborsClassifier(
n_neighbors=5,
weights='distance',
metric='euclidean'
)
```

**Pros**:
- Captures non-linear patterns
- No training phase
- Effective for local patterns

**Cons**:
- Slow prediction time
- Sensitive to feature scaling
- Memory intensive

---

### 3. Support Vector Machine (SVM)

**When to use**: High-dimensional data, maximum margin classification

**Parameters**:
```python
SVC(
kernel='rbf',
C=1.0,
gamma='scale',
probability=True,
random_state=42
)
```

**Pros**:
- Effective in high dimensions
- Robust to outliers
- Strong theoretical foundation

**Cons**:
- Slower training
- Requires feature scaling
- Hard to interpret

---

### Data Balancing Techniques

#### Original Distribution
```
Staying: 75% (majority)
Leaving: 25% (minority)
```

#### Oversampling
```
Randomly duplicate minority class samples
Result: 75% vs 75% balanced distribution
```

#### Undersampling
```
Randomly remove majority class samples
Result: 25% vs 25% balanced distribution
```

---

## ๐Ÿ“Š Dataset

### Features (12 input features)

| Feature | Type | Range/Values | Description |
|---------|------|--------------|-------------|
| city_development_index | float | 0.0 - 1.0 | City development level |
| gender | categorical | M/F | Employee gender |
| relevant_experience | binary | Yes/No | Has relevant experience |
| enrolled_university | categorical | Full-time/Part-time/No | University enrollment |
| education_level | categorical | HS/Bachelor/Master/PhD | Highest education |
| major_discipline | categorical | STEM/Business/Humanities | Field of study |
| experience | integer | 0-50 | Years of experience |
| company_size | categorical | Startup/MNC/Unicorn | Company size |
| company_type | categorical | IT/Service/Healthcare | Industry type |
| last_new_job | integer | 0-5 | Years at current job |
| training_hours | integer | 0-500 | Professional training hours |
| **target** | **binary** | **0/1** | **0=Stays, 1=Leaves** |

### Dataset Size

- **Total Records**: 19,158 employees
- **Training Set**: 70% (13,410 records)
- **Test Set**: 30% (5,748 records)
- **Missing Values**: < 2% (handled)
- **Class Imbalance**: 75% vs 25%

### Data Preprocessing

```python
# Steps applied:
1. Load CSV data
2. Handle missing values (mean/mode imputation)
3. Encode categorical variables (LabelEncoder)
4. Scale numerical features (StandardScaler)
5. Split train/test (80/20)
6. Handle class imbalance (oversample/undersample)
```

---

## ๐Ÿ“ˆ Results & Performance

### Model Comparison (on test set)

| Metric | Logistic Regression | KNN (k=5) | SVM (RBF) |
|--------|-------------------|-----------|----------|
| **Accuracy** | 78.23% | 76.45% | 79.12% |
| **Precision** | 0.7891 | 0.7654 | 0.8023 |
| **Recall** | 0.7456 | 0.7234 | 0.7789 |
| **F1-Score** | 0.7667 | 0.7440 | 0.7904 |
| **AUC-ROC** | 0.8234 | 0.8012 | 0.8456 |
| **Training Time** | 2.3s | 0.5s | 45.2s |

### Best Performing Model: SVM
- Highest accuracy and F1-score
- Good balance between precision and recall
- Acceptable training time

---

## ๐Ÿ”Œ API Endpoints

### Flask Routes

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/` | GET | Home page |
| `/train` | GET, POST | Training interface |
| `/predict` | GET, POST | Prediction interface |
| `/results` | GET | View training results |
| `/dashboard` | GET | Analytics dashboard |
| `/api/train-model` | POST | Train model (JSON API) |
| `/api/predict` | POST | Make prediction (JSON API) |
| `/api/models` | GET | List trained models |
| `/api/export` | GET | Export results as CSV |

### API Examples

**Train Model**:
```bash
curl -X POST http://localhost:5000/api/train-model \
-H "Content-Type: application/json" \
-d '{
"algorithm": "svm",
"dataset": "oversample"
}'
```

**Make Prediction**:
```bash
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{
"city_development_index": 0.92,
"gender": "M",
"relevant_experience": "Yes",
"experience": 3,
"training_hours": 40
}'
```

---

## ๐Ÿณ Deployment

### Docker Setup

1. **Build image**:
```bash
docker build -t ml-predictor:latest .
```

2. **Run container**:
```bash
docker run -p 5000:5000 ml-predictor:latest
```

3. **Using Docker Compose**:
```bash
docker-compose up
```

### Production Deployment

**Using Gunicorn**:
```bash
gunicorn --workers 4 --bind 0.0.0.0:5000 main:app
```

**On Heroku**:
```bash
heroku login
heroku create ml-predictor
git push heroku main
```

---

## ๐Ÿงช Testing

### Run Tests
```bash
python -m pytest tests/
```

### Test Coverage
- Unit tests for model training
- Integration tests for API endpoints
- Data preprocessing tests
- Prediction accuracy tests

---

## ๐Ÿ› Troubleshooting

### Issue: "ModuleNotFoundError"
**Solution**: Install requirements
```bash
pip install -r requirements.txt
```

### Issue: "FileNotFoundError: data files"
**Solution**: Ensure CSV files exist in `data/` directory

### Issue: "Port 5000 already in use"
**Solution**: Use different port
```bash
python main.py --port 5001
```

---

## ๐Ÿ“ˆ Future Enhancements

- [ ] Deep learning models (Neural Networks)
- [ ] Real-time data streaming
- [ ] Advanced feature engineering
- [ ] Model explainability (SHAP, LIME)
- [ ] A/B testing framework
- [ ] Automated retraining pipeline
- [ ] Mobile app integration
- [ ] Multi-language support
- [ ] Advanced visualization dashboards
- [ ] REST API v2

---

## ๐Ÿ“ Contributing

1. Fork repository
2. Create feature branch (`git checkout -b feature/improvement`)
3. Commit changes (`git commit -m 'Add improvement'`)
4. Push to branch (`git push origin feature/improvement`)
5. Open Pull Request

---

## ๐Ÿ“„ License

This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.

---

## ๐Ÿ™ Acknowledgments

- [Kaggle](https://www.kaggle.com/) - HR Analytics dataset
- [scikit-learn](https://scikit-learn.org/) - ML algorithms
- [Flask](https://flask.palletsprojects.com/) - Web framework
- [Pandas](https://pandas.pydata.org/) - Data processing

---

## ๐Ÿ‘จโ€๐Ÿ’ป Author

**Ahmed Hesham** - [@Ahmed122000](https://github.com/Ahmed122000)

**Built with โค๏ธ for HR Analytics & ML Deployment**