https://github.com/mohsinraza2999/generous-tipper
A production level modular data science project aims to predict generous tippers for taxi drivers.
https://github.com/mohsinraza2999/generous-tipper
backend-development ci-pipeline data-analysis data-cleaning-and-preprocessing docker exploratory-data-analysis fastapi feature-engineering front-end hypothesis-testing logistic-regression randon-forest understanding-business-problem xgboost-classifier
Last synced: 9 days ago
JSON representation
A production level modular data science project aims to predict generous tippers for taxi drivers.
- Host: GitHub
- URL: https://github.com/mohsinraza2999/generous-tipper
- Owner: mohsinraza2999
- Created: 2026-02-08T14:59:13.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-11T19:08:35.000Z (4 months ago)
- Last Synced: 2026-02-11T21:49:30.659Z (4 months ago)
- Topics: backend-development, ci-pipeline, data-analysis, data-cleaning-and-preprocessing, docker, exploratory-data-analysis, fastapi, feature-engineering, front-end, hypothesis-testing, logistic-regression, randon-forest, understanding-business-problem, xgboost-classifier
- Language: Jupyter Notebook
- Homepage:
- Size: 1.01 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Generous Tip Giver Prediction
## Problem
Taxi ride-hailing platforms rely heavily on tips as a key component of driver income, yet passenger tipping behavior is highly variable and difficult to predict. This unpredictability limits the platformโs ability to optimize driverโrider matching, incentives, and service quality. Large volumes of trip, fare, temporal, and behavioral data are generated but remain underutilized for tipping prediction. A data science and machine learning approach can identify patterns that distinguish generous tippers from others. Ultimately, this leads to higher service quality, better retention, and increased platform efficiency.
## Solution
Built a full ML pipeline including:
- Data ingestion & cleaning
- Feature engineering
- Model training (XGBoost, Random Forest, Logistic Regression)
- Fast API deployment
- Dockerized application
## ๐ Dataset
* **Type:** Yellow Taxi Trip dataset from kaggle
* **Target:** Generous Tipper
* **Features:** Eighteen Numerical and encoded categorical attributes
* **Size:** 22700 Observations
## Tech Stack
Python, Pandas, Scikit-learn, XGBoost, FastAPI, Docker
## Architecture
```text
generous-tipper/
โ
โโโ data/ # raw & processed data
โโโ config/ # data & training configurations
โโโ frontend/ # Core frontend logic with dockerization
โโโ notebooks/ # Training and data cleaning notebooks
โโโ src/ # Core data, training and backend pipeline logic
โโโ tests/ # Basic unit tests of data, training, api pipelines
โโโ docker-compose.yaml # dockerizing back and frontend with health check every 30 seconds
โโโ Dockerfile # multi-step dockerization for clean containerization
โโโ pyproject.toml
โโโ README.md
โโโ LICENSE
```
---
## ๐ Quick Start
```bash
git clone https://github.com/mohsinraza2999/generous-tipper.git
cd house-price-prediction
python src/cli.py preprocess
python src/cli.py train
python src/cli.py route
```
---
## ๐ฎ Making Predictions
```bash
python src/cli.py route
```
For only backend and Swagger UI.
```text
http://localhost:8000/docs
```
Example response:
```json
{
"prediction": "generous",
"processed_at": "10-02-2026T07:30:21S",
"latency_ms": 0.04
}
```
---
## ๐งช Testing
Run all unit and integration tests:
```bash
pip install pytest
pytest tests/
```
Tests cover:
* Data preprocessing pipeline
* API routes
* Model inference behavior
---
## ๐งฑ Docker Build
Dockerize back and frontend. Also check health in every 30 seconds.
```bash
docker-compose up --build
```
1. Run in browser for both front and backend
```text
http://localhost:3000
```
2. For only backend and Swagger UI.
```text
http://localhost:8000/docs
```
Example response:
```json
{
"prediction": "generous",
"processed_at": "10-02-2026T07:30:21S",
"latency_ms": 0.04
}
```
---
## ๐ง Configuration
* All hyperparameters stored in YAML files
* Data paths, training parameters, and inference behavior configurable
* Environment-agnostic (local or containerized)
---
## ๐ง Design Decisions & Trade-offs
* **Why Dachine Learning?**
Beause tree-based models perform well on tabular data, so neural networks are not chosen to practice model abstraction, extensibility, and deployment workflows.
* **Why config-driven pipelines?**
To separate experimentation from code changes and improve reproducibility.
* **Why both CLI and scripts?**
CLI serves developers; scripts support automation and CI.
---
## Future Improvements
* Model monitoring & drift detection
* Cloud deployment
---
## ๐ง Key Learnings
* ML systems should be designed as maintainable software
* Testing pipelines prevents silent failures
* Separation of training and inference is critical
---
## ๐ CI & Automation
* GitHub Actions pipeline:
* Runs tests on push
* Ensures build stability
* Docker build validation included
---
## ๐ฌ Contact
**Author:** Mohsin Raza
**Target Role:** Machine Learning Engineer / AI Engineer
**GitHub:** [github/mohsinraza2999](https://github.com/mohsinraza2999)
**LinkedIn:** *[linkedin/mohsin-raza](https://www.linkedin.com/in/mohsin-raza-b7ab73328)*