https://github.com/friendotjava/income-prediction
A machine learning project classifying whether someone has income >$50K or <$50K using several models. Integrated with DVC Pipeline.
https://github.com/friendotjava/income-prediction
Last synced: 8 months ago
JSON representation
A machine learning project classifying whether someone has income >$50K or <$50K using several models. Integrated with DVC Pipeline.
- Host: GitHub
- URL: https://github.com/friendotjava/income-prediction
- Owner: FrienDotJava
- Created: 2025-09-28T03:55:31.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-28T06:58:31.000Z (9 months ago)
- Last Synced: 2025-09-28T07:22:52.865Z (9 months ago)
- Language: Python
- Size: 11.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Income Prediction (Adult Census)
> A modular machine-learning project that predicts whether a personβs annual income is **> $50K** or **β€ $50K**, built with a clean Cookiecutter Data Science structure, tracked with **DVC** (and DVCLive), and instrumented for **MLflow** experiment logging. A small **Streamlit** dashboard is included for quick exploration.
---
Try it here: https://income-prediction1.streamlit.app/
## π Project goals
- Train solid baseline & boosted tree models for the Adult/Census Income task (binary classification >$50K).
- Keep work **reproducible** (DVC pipelines + parameters), **trackable** (DVCLive/MLflow), and **organized** (Cookiecutter DS layout).
- Provide a minimal **dashboard** to poke the model and visualize results.
---
## π Repository structure
```
βββ data/ # raw/ β interim/ β processed/ (DVC-managed)
βββ docs/ # (optional) project docs
βββ dvclive/ # live metrics/artifacts from runs
βββ income_classification/
β βββ __init__.py
β βββ config.py
β βββ dataset.py # data download/prepare helpers
β βββ features.py # feature engineering
β βββ modeling/
β βββ __init__.py
β βββ predict.py # inference script
β βββ train.py # training script
βββ notebooks/ # EDA & scratch work
βββ references/ # data dictionary, notes, etc.
βββ dashboard.py # Streamlit mini app
βββ dvc.yaml # DVC pipeline (stages & deps)
βββ dvc.lock # DVC lockfile (auto-generated)
βββ params.yaml # central hyperparams & config
βββ requirements.txt # Python dependencies
βββ Makefile # convenience commands
βββ README.md
```
---
## π¦ Dataset
This project uses the **Adult (Census Income)** dataset: **48,842** rows, **14** features, binary target (> $50K).
You can obtain it from UCI or Kaggle.
> Place raw files under `data/raw/`.
---
## π οΈ Quickstart
### 1) Setup environment
```bash
git clone https://github.com/FrienDotJava/income-prediction.git
cd income-prediction
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
### 2) Get the data
Download the Adult/Census Income data and put it here:
```
data/
βββ raw/
βββ adult.csv
```
### 3) Reproduce the pipeline (DVC)
```bash
dvc repro
```
- **Stages** and dependencies live in `dvc.yaml`; `dvc repro` runs data prep β feature building β training β evaluation.
- Metrics and plots are logged in `dvclive/`.
### 4) Tweak parameters & rerun
Edit **`params.yaml`** to change model settings, then:
```bash
dvc repro
```
### 5) Track experiments (MLflow)
```bash
mlflow ui
```
Open [http://127.0.0.1:5000](http://127.0.0.1:5000) to explore experiments.
### 6) Run the dashboard
```bash
streamlit run dashboard.py
```
---
## π§ͺ Pipeline overview
- **Data prep**: clean & split the Adult dataset.
- **Feature engineering**: encode categoricals, scale numerics.
- **Model training**: Logistic Regression, RandomForest, GradientBoosting, etc.
- **Evaluation**: accuracy, precision, recall, F1, ROC-AUC, confusion matrix.
- **Experiment tracking**: DVC, DVCLive, MLflow.
---
## π Example results
Typical accuracy: **80β86%** (depends on preprocessing and model).
---
## βΆοΈ Makefile shortcuts
```bash
make train # Run training
make clean # Clean temp artifacts
make dashboard # Launch Streamlit dashboard
```
---
## π References
- Adult (Census Income) dataset (UCI)
- Kaggle: Adult Census Income
- Cookiecutter Data Science
---
## π‘ Tips
- If only `params.yaml` changes, rerun `dvc repro`.
- Use `dvc commit && dvc push` to sync data to remote storage.
- Use Streamlit for fast visual validation.