An open API service indexing awesome lists of open source software.

https://github.com/jonasneves/aipi510-project3

Duke AIPI 510 Project 3 • AI/ML Salary Predictor • XGBoost model trained on H1B, Linkedin, Adzuna data • FastAPI + React
https://github.com/jonasneves/aipi510-project3

aws-s3 fastapi machine-learning mlflow python react salary-prediction xgboost

Last synced: 5 months ago
JSON representation

Duke AIPI 510 Project 3 • AI/ML Salary Predictor • XGBoost model trained on H1B, Linkedin, Adzuna data • FastAPI + React

Awesome Lists containing this project

README

          

# AI Salary Prediction Pipeline

[![API Status](https://img.shields.io/endpoint?url=https://aisalary.neevs.io/api/badge/api&label=API)](https://aisalary.neevs.io/api)
[![App Status](https://img.shields.io/endpoint?url=https://aisalary.neevs.io/api/badge/app&label=App)](https://aisalary.neevs.io)

[![ML Pipeline](https://github.com/jonasneves/aipi510-project3/actions/workflows/ml-pipeline.yml/badge.svg)](https://github.com/jonasneves/aipi510-project3/actions/workflows/ml-pipeline.yml)

**Live Demo:** [aisalary.neevs.io](https://aisalary.neevs.io) | [API Docs](https://aisalary.neevs.io/api/docs) | [Reports Portal (EDA + MLflow)](https://jonasneves.github.io/aipi510-project3/)

## Overview

Predict AI/ML salaries using machine learning. Built for Duke AIPI 510 Module Project 3.

**Problem:** Estimate salary ranges for AI/ML roles based on job title, location, experience, and skills.

**Solution:** XGBoost regression model trained on H1B visa filings, LinkedIn job postings, and Adzuna market data, deployed as a FastAPI service with a React frontend.

## Dataset

| Source | Description | Priority | Records |
|--------|-------------|----------|---------|
| [H1B Visa Data](https://www.dol.gov/agencies/eta/foreign-labor/performance) | DOL certified visa applications with actual salaries | 1 | ~10,000 AI/ML jobs |
| [LinkedIn](https://www.linkedin.com) | Job postings with detailed salary, seniority, skills data | 1 | ~1,000+ (growing) |
| [Adzuna](https://developer.adzuna.com/) | Job postings with salary ranges | 2 | ~16,500 |

Data hosted on AWS S3. Pipeline downloads and merges sources automatically.

## Model

**Architecture:** XGBoost Regressor

| Parameter | Value |
|-----------|-------|
| n_estimators | 200 |
| max_depth | 6 |
| learning_rate | 0.1 |

**Evaluation Metrics:**
- MAE: ~$36,000
- RMSE: ~$52,000
- MAPE: ~23%

**Top Features:** Years of experience, company tier, role type (researcher/scientist/analyst), entry-level indicator

## Experiment Tracking

MLflow is used for experiment tracking during model training.

→ **[MLflow Overview](docs/MLflow-Overview.pdf)** | **[Feature Importance](docs/MLflow-FeatureImportance.pdf)**

## Architecture

![Architecture Diagram](architecture.png)

## Quick Start

```bash
make install # Install Python dependencies
make frontend-install # Install frontend dependencies
make pipeline # Collect data, merge, and train model
make api # Start FastAPI server (port 8000)
make frontend # Start React dev server (port 5173)
```

To train just the model: `make train`
To test API locally: `curl http://localhost:8000/api/health`

## Tech Stack

| Layer | Technology |
|-------|------------|
| ML | XGBoost, scikit-learn, pandas |
| API | FastAPI, Pydantic |
| Frontend | React, Vite, Tailwind CSS |
| Tracking | MLFlow |
| Cloud Storage | AWS S3 (data hosting) |
| Cloud Deployment | Cloudflare Tunnel (API + frontend) |
| CI/CD | GitHub Actions |

## Project Structure

```
src/ # ML pipeline (collectors, processing, models)
api/ # FastAPI endpoints
frontend-react/ # React frontend
configs/ # YAML configuration files
config.yaml # Main pipeline configuration
Makefile # Build commands
Dockerfile # Container build
```

## API

```bash
# Predict salary
curl -X POST https://aisalary.neevs.io/api/predict \
-H "Content-Type: application/json" \
-d '{"job_title": "ML Engineer", "location": "CA", "experience_years": 5}'

# Get options
curl https://aisalary.neevs.io/api/options
```

## Documentation

- [Setup Guide](docs/SETUP.md) - Local development
- [Deployment Guide](docs/DEPLOYMENT.md) - Cloud deployment & AWS setup

## Limitations & Ethical Considerations

- **Geographic bias:** H1B data skews toward CA, NY, WA where most visa sponsors operate
- **Role coverage:** Limited to AI/ML titles; doesn't cover adjacent roles well
- **Temporal lag:** H1B filings reflect offers made 6-12 months prior
- **Company representation:** Large tech companies overrepresented vs. startups
- **Responsible Use:** Use predictions as one data point among many; avoid anchoring salary negotiations solely on model outputs

## AI Usage Acknowledgement

**AI Assistants:**
- Claude Code (Anthropic) - code development, documentation, and research
- Gemini 3 Pro Image / Nano Banana Pro (Google) - visual design

All code and analysis were reviewed, tested, and thoroughly understood by the team. The team takes full responsibility for the implementation and can explain all design decisions.

## Authors

Jonas De Oliveira Neves & Omkar Sreekanth

Duke University - AIPI 510, 2025

## License

MIT