https://github.com/harshitaphadtare/gopredict
https://github.com/harshitaphadtare/gopredict
full-stack-application machine-learning
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/harshitaphadtare/gopredict
- Owner: harshitaphadtare
- Created: 2025-09-21T13:38:56.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-10-01T15:14:03.000Z (4 months ago)
- Last Synced: 2025-10-01T17:24:30.595Z (4 months ago)
- Topics: full-stack-application, machine-learning
- Language: Jupyter Notebook
- Homepage:
- Size: 30.4 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GoPredict - Machine Learning Pipeline for Trip Duration Prediction
A comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.
Medium post - https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9
## 📁 Project Structure
```
GoPredict/
├── main.py # Main runner script
├── config.py # Project configuration
├── requirements.txt # Python dependencies
├── README.md # This file
│
├── data/ # Data directory
│ ├── raw/ # Raw data files
│ │ ├── train.csv # Training data
│ │ └── test.csv # Test data
│ ├── processed/ # Processed data files
│ │ ├── feature_engineered_train.csv
│ │ ├── feature_engineered_test.csv
│ │ └── gmapsdata/ # Google Maps data
│ └── external/ # External data sources
│ └── precipitation.csv # Weather data
│
├── src/ # Source code
│ ├── model/ # Model-related modules
│ │ ├── models.py # All ML models and pipeline
│ │ ├── evaluation.py # Model evaluation functions
│ │ └── save_models.py # Model persistence
│ ├── features/ # Feature engineering modules
│ │ ├── distance.py # Distance calculations
│ │ ├── geolocation.py # Geographic features
│ │ ├── gmaps.py # Google Maps integration
│ │ ├── precipitation.py # Weather features
│ │ └── time.py # Time-based features
│ ├── feature_pipe.py # Feature engineering pipeline
│ ├── data_preprocessing.py # Data preprocessing
│ └── complete_pipeline_example.py # Usage examples
│
├── notebooks/ # Jupyter notebooks
│ ├── 01_EDA.ipynb # Exploratory Data Analysis
│ ├── 02_Feature_Engineering.ipynb # Feature engineering
│ ├── 03_Model_Training.ipynb # Model training
│ ├── figures/ # Generated plots
│ └── gmaps/ # Interactive maps
│
├── saved_models/ # Trained models (auto-created)
├── output/ # Predictions and submissions (auto-created)
└── logs/ # Log files (auto-created)
```
## 🚀 Quick Start
### 1. Installation
```bash
# Clone the repository
git clone
cd GoPredict
# Install dependencies
pip install -r requirements.txt
# Create necessary directories
mkdir -p logs output saved_models
```
### 2. Data Preparation
Ensure you have the following data files in place:
- `data/raw/train.csv` - Training data
- `data/raw/test.csv` - Test data
- `data/external/precipitation.csv` - Weather data
### 3. Run the Pipeline
```bash
# Run COMPLETE end-to-end pipeline (RECOMMENDED)
python main.py --mode complete
# Run complete pipeline with all models (assumes feature engineering is done)
python main.py --mode full
# Train specific models only (assumes feature engineering is done)
python main.py --mode train --models LINREG,RIDGE,XGB
# Make predictions only (assumes feature engineering is done)
python main.py --mode predict --models XGB
# Hyperparameter tuning only (assumes feature engineering is done)
python main.py --mode tune
# Enable XGBoost hyperparameter tuning
python main.py --mode complete --tune-xgb
```
## 📊 Available Models
| Model | Code | Description |
| ------------------------- | -------- | ---------------------------------- |
| Linear Regression | `LINREG` | Baseline linear model |
| Ridge Regression | `RIDGE` | Linear with L2 regularization |
| Lasso Regression | `LASSO` | Linear with L1 regularization |
| Support Vector Regression | `SVR` | Support vector machine |
| XGBoost | `XGB` | Gradient boosting (best performer) |
| Random Forest | `RF` | Ensemble of decision trees |
| Neural Network | `NN` | Deep learning model |
## 🎯 Usage
### Simple Pipeline (Default)
```bash
python main.py
```
Runs the complete end-to-end pipeline:
- **Data preprocessing** - Loads and cleans raw data
- **Feature engineering** - Adds distance, time, cluster, and weather features
- **Model training** - Trains all specified models
- **Model evaluation** - Compares model performance
- **Prediction generation** - Creates submission files
### Custom Models
```bash
python main.py --models XGB,RF
```
Train only specific models.
### With Hyperparameter Tuning
```bash
python main.py --tune-xgb
```
Enable XGBoost hyperparameter tuning.
## 📈 Output Files
### Predictions
- `output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv`
- Ready-to-submit prediction files with timestamps
### Models
- `saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl`
- Trained models with metadata
### Logs
- `logs/main.log` - Complete pipeline execution log
- Detailed progress tracking and metrics
### Visualizations
- `output/prediction_comparison_YYYYMMDD_HHMMSS.png`
- Model comparison plots
- Feature importance plots
## 🔧 Configuration
Edit `config.py` to customize:
- Model parameters
- Data paths
- Output directories
- Hyperparameter tuning ranges
- Logging settings
## 📝 Usage Examples
### Basic Usage
```python
from src.model.models import run_complete_pipeline
import pandas as pd
# Load data
train_df = pd.read_csv('data/processed/feature_engineered_train.csv')
test_df = pd.read_csv('data/processed/feature_engineered_test.csv')
# Run complete pipeline
results = run_complete_pipeline(
train_df=train_df,
test_df=test_df,
models_to_run=['LINREG', 'RIDGE', 'XGB'],
tune_xgb=True,
create_submission=True
)
```
### Individual Components
```python
from src.model.models import run_regression_models, predict_duration, to_submission
# Train models
models = run_regression_models(train_df, ['XGB', 'RF'])
# Make predictions
predictions = predict_duration(models['XGBoost'], test_df)
# Create submission
submission_file = to_submission(predictions)
```
### Hyperparameter Tuning
```python
from src.model.models import hyperparameter_tuning_xgb
# Tune XGBoost
best_model, best_params, best_rmse = hyperparameter_tuning_xgb(train_df)
print(f"Best RMSE: {best_rmse}")
print(f"Best parameters: {best_params}")
```
## 🎨 Features
### Data Processing
- **Feature Engineering**: Distance calculations, time features, weather data
- **Normalization**: Custom normalization for different feature types
- **Data Validation**: Automatic data quality checks
### Model Training
- **Multiple Algorithms**: 7 different regression models
- **Hyperparameter Tuning**: Automated XGBoost optimization
- **Cross-Validation**: Built-in validation splits
- **Progress Tracking**: Detailed logging with sandwich format
### Evaluation
- **Comprehensive Metrics**: RMSE, MAE, R², MAPE
- **Visual Comparisons**: Histogram comparisons, feature importance
- **Model Persistence**: Save and load trained models
### Output
- **Submission Files**: Ready-to-submit CSV files
- **Visualizations**: Plots and charts for analysis
- **Logging**: Complete audit trail
## 🐛 Troubleshooting
### Common Issues
1. **Missing Data Files**
```
FileNotFoundError: Data file not found
```
Solution: Ensure all required data files are in the correct directories
2. **Import Errors**
```
ModuleNotFoundError: No module named 'xgboost'
```
Solution: Install missing dependencies: `pip install -r requirements.txt`
3. **Memory Issues**
```
MemoryError: Unable to allocate array
```
Solution: Reduce batch size or use fewer models
### Getting Help
- Check logs in `logs/main.log` for detailed error messages
- Verify data files are in correct format and location
- Ensure all dependencies are installed correctly
## 📊 Performance
Typical model performance on validation set:
- **XGBoost**: ~400-450 RMSE (best performer)
- **Random Forest**: ~420-470 RMSE
- **Linear Models**: ~450-500 RMSE
- **Neural Network**: ~430-480 RMSE
## 🔮 Future Enhancements
- [ ] Automated feature selection
- [ ] Real-time prediction API
- [ ] Model monitoring dashboard
- [ ] A/B testing framework
## 📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🤝 Contributing
Please read [CONTRIBUTING.md](CONTRIBUTING.md). By participating, you agree to abide by our [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) and report vulnerabilities per [SECURITY.md](SECURITY.md).
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## 📞 Support
For questions or issues, please:
1. Check the logs first
2. Review this documentation