https://github.com/jonasneves/aipi510-project3
Duke AIPI 510 Project 3 • AI/ML Salary Predictor • XGBoost model trained on H1B, Linkedin, Adzuna data • FastAPI + React
https://github.com/jonasneves/aipi510-project3
aws-s3 fastapi machine-learning mlflow python react salary-prediction xgboost
Last synced: 5 months ago
JSON representation
Duke AIPI 510 Project 3 • AI/ML Salary Predictor • XGBoost model trained on H1B, Linkedin, Adzuna data • FastAPI + React
- Host: GitHub
- URL: https://github.com/jonasneves/aipi510-project3
- Owner: jonasneves
- License: mit
- Created: 2025-11-22T15:07:50.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-12-21T02:16:03.000Z (6 months ago)
- Last Synced: 2025-12-22T23:19:27.932Z (6 months ago)
- Topics: aws-s3, fastapi, machine-learning, mlflow, python, react, salary-prediction, xgboost
- Language: Jupyter Notebook
- Homepage: https://aisalary.neevs.io/
- Size: 9.08 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI Salary Prediction Pipeline
[](https://aisalary.neevs.io/api)
[](https://aisalary.neevs.io)
[](https://github.com/jonasneves/aipi510-project3/actions/workflows/ml-pipeline.yml)
**Live Demo:** [aisalary.neevs.io](https://aisalary.neevs.io) | [API Docs](https://aisalary.neevs.io/api/docs) | [Reports Portal (EDA + MLflow)](https://jonasneves.github.io/aipi510-project3/)
## Overview
Predict AI/ML salaries using machine learning. Built for Duke AIPI 510 Module Project 3.
**Problem:** Estimate salary ranges for AI/ML roles based on job title, location, experience, and skills.
**Solution:** XGBoost regression model trained on H1B visa filings, LinkedIn job postings, and Adzuna market data, deployed as a FastAPI service with a React frontend.
## Dataset
| Source | Description | Priority | Records |
|--------|-------------|----------|---------|
| [H1B Visa Data](https://www.dol.gov/agencies/eta/foreign-labor/performance) | DOL certified visa applications with actual salaries | 1 | ~10,000 AI/ML jobs |
| [LinkedIn](https://www.linkedin.com) | Job postings with detailed salary, seniority, skills data | 1 | ~1,000+ (growing) |
| [Adzuna](https://developer.adzuna.com/) | Job postings with salary ranges | 2 | ~16,500 |
Data hosted on AWS S3. Pipeline downloads and merges sources automatically.
## Model
**Architecture:** XGBoost Regressor
| Parameter | Value |
|-----------|-------|
| n_estimators | 200 |
| max_depth | 6 |
| learning_rate | 0.1 |
**Evaluation Metrics:**
- MAE: ~$36,000
- RMSE: ~$52,000
- MAPE: ~23%
**Top Features:** Years of experience, company tier, role type (researcher/scientist/analyst), entry-level indicator
## Experiment Tracking
MLflow is used for experiment tracking during model training.
→ **[MLflow Overview](docs/MLflow-Overview.pdf)** | **[Feature Importance](docs/MLflow-FeatureImportance.pdf)**
## Architecture

## Quick Start
```bash
make install # Install Python dependencies
make frontend-install # Install frontend dependencies
make pipeline # Collect data, merge, and train model
make api # Start FastAPI server (port 8000)
make frontend # Start React dev server (port 5173)
```
To train just the model: `make train`
To test API locally: `curl http://localhost:8000/api/health`
## Tech Stack
| Layer | Technology |
|-------|------------|
| ML | XGBoost, scikit-learn, pandas |
| API | FastAPI, Pydantic |
| Frontend | React, Vite, Tailwind CSS |
| Tracking | MLFlow |
| Cloud Storage | AWS S3 (data hosting) |
| Cloud Deployment | Cloudflare Tunnel (API + frontend) |
| CI/CD | GitHub Actions |
## Project Structure
```
src/ # ML pipeline (collectors, processing, models)
api/ # FastAPI endpoints
frontend-react/ # React frontend
configs/ # YAML configuration files
config.yaml # Main pipeline configuration
Makefile # Build commands
Dockerfile # Container build
```
## API
```bash
# Predict salary
curl -X POST https://aisalary.neevs.io/api/predict \
-H "Content-Type: application/json" \
-d '{"job_title": "ML Engineer", "location": "CA", "experience_years": 5}'
# Get options
curl https://aisalary.neevs.io/api/options
```
## Documentation
- [Setup Guide](docs/SETUP.md) - Local development
- [Deployment Guide](docs/DEPLOYMENT.md) - Cloud deployment & AWS setup
## Limitations & Ethical Considerations
- **Geographic bias:** H1B data skews toward CA, NY, WA where most visa sponsors operate
- **Role coverage:** Limited to AI/ML titles; doesn't cover adjacent roles well
- **Temporal lag:** H1B filings reflect offers made 6-12 months prior
- **Company representation:** Large tech companies overrepresented vs. startups
- **Responsible Use:** Use predictions as one data point among many; avoid anchoring salary negotiations solely on model outputs
## AI Usage Acknowledgement
**AI Assistants:**
- Claude Code (Anthropic) - code development, documentation, and research
- Gemini 3 Pro Image / Nano Banana Pro (Google) - visual design
All code and analysis were reviewed, tested, and thoroughly understood by the team. The team takes full responsibility for the implementation and can explain all design decisions.
## Authors
Jonas De Oliveira Neves & Omkar Sreekanth
Duke University - AIPI 510, 2025
## License
MIT