An open API service indexing awesome lists of open source software.

https://github.com/friendotjava/income-prediction

A machine learning project classifying whether someone has income >$50K or <$50K using several models. Integrated with DVC Pipeline.
https://github.com/friendotjava/income-prediction

Last synced: 8 months ago
JSON representation

A machine learning project classifying whether someone has income >$50K or <$50K using several models. Integrated with DVC Pipeline.

Awesome Lists containing this project

README

          

# Income Prediction (Adult Census)

> A modular machine-learning project that predicts whether a person’s annual income is **> $50K** or **≀ $50K**, built with a clean Cookiecutter Data Science structure, tracked with **DVC** (and DVCLive), and instrumented for **MLflow** experiment logging. A small **Streamlit** dashboard is included for quick exploration.

---

Try it here: https://income-prediction1.streamlit.app/

## πŸ“Œ Project goals

- Train solid baseline & boosted tree models for the Adult/Census Income task (binary classification >$50K).
- Keep work **reproducible** (DVC pipelines + parameters), **trackable** (DVCLive/MLflow), and **organized** (Cookiecutter DS layout).
- Provide a minimal **dashboard** to poke the model and visualize results.

---

## πŸ—‚ Repository structure

```
β”œβ”€β”€ data/ # raw/ β†’ interim/ β†’ processed/ (DVC-managed)
β”œβ”€β”€ docs/ # (optional) project docs
β”œβ”€β”€ dvclive/ # live metrics/artifacts from runs
β”œβ”€β”€ income_classification/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ config.py
β”‚ β”œβ”€β”€ dataset.py # data download/prepare helpers
β”‚ β”œβ”€β”€ features.py # feature engineering
β”‚ └── modeling/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ predict.py # inference script
β”‚ └── train.py # training script
β”œβ”€β”€ notebooks/ # EDA & scratch work
β”œβ”€β”€ references/ # data dictionary, notes, etc.
β”œβ”€β”€ dashboard.py # Streamlit mini app
β”œβ”€β”€ dvc.yaml # DVC pipeline (stages & deps)
β”œβ”€β”€ dvc.lock # DVC lockfile (auto-generated)
β”œβ”€β”€ params.yaml # central hyperparams & config
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ Makefile # convenience commands
└── README.md
```

---

## πŸ“¦ Dataset

This project uses the **Adult (Census Income)** dataset: **48,842** rows, **14** features, binary target (> $50K).
You can obtain it from UCI or Kaggle.

> Place raw files under `data/raw/`.

---

## πŸ› οΈ Quickstart

### 1) Setup environment

```bash
git clone https://github.com/FrienDotJava/income-prediction.git
cd income-prediction

python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate

pip install -r requirements.txt
```

### 2) Get the data

Download the Adult/Census Income data and put it here:

```
data/
└── raw/
└── adult.csv
```

### 3) Reproduce the pipeline (DVC)

```bash
dvc repro
```

- **Stages** and dependencies live in `dvc.yaml`; `dvc repro` runs data prep β†’ feature building β†’ training β†’ evaluation.
- Metrics and plots are logged in `dvclive/`.

### 4) Tweak parameters & rerun

Edit **`params.yaml`** to change model settings, then:

```bash
dvc repro
```

### 5) Track experiments (MLflow)

```bash
mlflow ui
```

Open [http://127.0.0.1:5000](http://127.0.0.1:5000) to explore experiments.

### 6) Run the dashboard

```bash
streamlit run dashboard.py
```

---

## πŸ§ͺ Pipeline overview

- **Data prep**: clean & split the Adult dataset.
- **Feature engineering**: encode categoricals, scale numerics.
- **Model training**: Logistic Regression, RandomForest, GradientBoosting, etc.
- **Evaluation**: accuracy, precision, recall, F1, ROC-AUC, confusion matrix.
- **Experiment tracking**: DVC, DVCLive, MLflow.

---

## πŸ“ˆ Example results

Typical accuracy: **80–86%** (depends on preprocessing and model).

---

## ▢️ Makefile shortcuts

```bash
make train # Run training
make clean # Clean temp artifacts
make dashboard # Launch Streamlit dashboard
```

---

## πŸ“š References

- Adult (Census Income) dataset (UCI)
- Kaggle: Adult Census Income
- Cookiecutter Data Science

---

## πŸ’‘ Tips

- If only `params.yaml` changes, rerun `dvc repro`.
- Use `dvc commit && dvc push` to sync data to remote storage.
- Use Streamlit for fast visual validation.