{"id":31785571,"url":"https://github.com/harshitaphadtare/gopredict","last_synced_at":"2025-10-10T11:59:04.016Z","repository":{"id":316326731,"uuid":"1061218756","full_name":"harshitaphadtare/GoPredict","owner":"harshitaphadtare","description":null,"archived":false,"fork":false,"pushed_at":"2025-10-01T15:14:03.000Z","size":31830,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-01T17:24:30.595Z","etag":null,"topics":["full-stack-application","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harshitaphadtare.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-21T13:38:56.000Z","updated_at":"2025-10-01T15:14:09.000Z","dependencies_parsed_at":"2025-09-24T01:23:29.658Z","dependency_job_id":null,"html_url":"https://github.com/harshitaphadtare/GoPredict","commit_stats":null,"previous_names":["harshitaphadtare/gopredict"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/harshitaphadtare/GoPredict","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshitaphadtare%2FGoPredict","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshitaphadtare%2FGoPredict/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshitaphadtare%2FGoPredict/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshitaphadtare%2FGoPredict/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harshitaphadtare","download_url":"https://codeload.github.com/harshitaphadtare/GoPredict/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshitaphadtare%2FGoPredict/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279003726,"owners_count":26083610,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["full-stack-application","machine-learning"],"created_at":"2025-10-10T11:59:01.601Z","updated_at":"2025-10-10T11:59:04.006Z","avatar_url":"https://github.com/harshitaphadtare.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GoPredict - Machine Learning Pipeline for Trip Duration Prediction\n\nA comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.\n\nMedium post - https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9\n\n## 📁 Project Structure\n\n```\nGoPredict/\n├── main.py                          # Main runner script\n├── config.py                        # Project configuration\n├── requirements.txt                  # Python dependencies\n├── README.md                        # This file\n│\n├── data/                            # Data directory\n│   ├── raw/                         # Raw data files\n│   │   ├── train.csv               # Training data\n│   │   └── test.csv                # Test data\n│   ├── processed/                   # Processed data files\n│   │   ├── feature_engineered_train.csv\n│   │   ├── feature_engineered_test.csv\n│   │   └── gmapsdata/              # Google Maps data\n│   └── external/                    # External data sources\n│       └── precipitation.csv       # Weather data\n│\n├── src/                            # Source code\n│   ├── model/                      # Model-related modules\n│   │   ├── models.py              # All ML models and pipeline\n│   │   ├── evaluation.py          # Model evaluation functions\n│   │   └── save_models.py         # Model persistence\n│   ├── features/                   # Feature engineering modules\n│   │   ├── distance.py            # Distance calculations\n│   │   ├── geolocation.py         # Geographic features\n│   │   ├── gmaps.py               # Google Maps integration\n│   │   ├── precipitation.py       # Weather features\n│   │   └── time.py                # Time-based features\n│   ├── feature_pipe.py            # Feature engineering pipeline\n│   ├── data_preprocessing.py      # Data preprocessing\n│   └── complete_pipeline_example.py # Usage examples\n│\n├── notebooks/                      # Jupyter notebooks\n│   ├── 01_EDA.ipynb               # Exploratory Data Analysis\n│   ├── 02_Feature_Engineering.ipynb # Feature engineering\n│   ├── 03_Model_Training.ipynb    # Model training\n│   ├── figures/                   # Generated plots\n│   └── gmaps/                     # Interactive maps\n│\n├── saved_models/                   # Trained models (auto-created)\n├── output/                         # Predictions and submissions (auto-created)\n└── logs/                          # Log files (auto-created)\n```\n\n## 🚀 Quick Start\n\n### 1. Installation\n\n```bash\n# Clone the repository\ngit clone \u003cyour-repo-url\u003e\ncd GoPredict\n\n# Install dependencies\npip install -r requirements.txt\n\n# Create necessary directories\nmkdir -p logs output saved_models\n```\n\n### 2. Data Preparation\n\nEnsure you have the following data files in place:\n\n- `data/raw/train.csv` - Training data\n- `data/raw/test.csv` - Test data\n- `data/external/precipitation.csv` - Weather data\n\n### 3. Run the Pipeline\n\n```bash\n# Run COMPLETE end-to-end pipeline (RECOMMENDED)\npython main.py --mode complete\n\n# Run complete pipeline with all models (assumes feature engineering is done)\npython main.py --mode full\n\n# Train specific models only (assumes feature engineering is done)\npython main.py --mode train --models LINREG,RIDGE,XGB\n\n# Make predictions only (assumes feature engineering is done)\npython main.py --mode predict --models XGB\n\n# Hyperparameter tuning only (assumes feature engineering is done)\npython main.py --mode tune\n\n# Enable XGBoost hyperparameter tuning\npython main.py --mode complete --tune-xgb\n```\n\n## 📊 Available Models\n\n| Model                     | Code     | Description                        |\n| ------------------------- | -------- | ---------------------------------- |\n| Linear Regression         | `LINREG` | Baseline linear model              |\n| Ridge Regression          | `RIDGE`  | Linear with L2 regularization      |\n| Lasso Regression          | `LASSO`  | Linear with L1 regularization      |\n| Support Vector Regression | `SVR`    | Support vector machine             |\n| XGBoost                   | `XGB`    | Gradient boosting (best performer) |\n| Random Forest             | `RF`     | Ensemble of decision trees         |\n| Neural Network            | `NN`     | Deep learning model                |\n\n## 🎯 Usage\n\n### Simple Pipeline (Default)\n\n```bash\npython main.py\n```\n\nRuns the complete end-to-end pipeline:\n\n- **Data preprocessing** - Loads and cleans raw data\n- **Feature engineering** - Adds distance, time, cluster, and weather features\n- **Model training** - Trains all specified models\n- **Model evaluation** - Compares model performance\n- **Prediction generation** - Creates submission files\n\n### Custom Models\n\n```bash\npython main.py --models XGB,RF\n```\n\nTrain only specific models.\n\n### With Hyperparameter Tuning\n\n```bash\npython main.py --tune-xgb\n```\n\nEnable XGBoost hyperparameter tuning.\n\n## 📈 Output Files\n\n### Predictions\n\n- `output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv`\n- Ready-to-submit prediction files with timestamps\n\n### Models\n\n- `saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl`\n- Trained models with metadata\n\n### Logs\n\n- `logs/main.log` - Complete pipeline execution log\n- Detailed progress tracking and metrics\n\n### Visualizations\n\n- `output/prediction_comparison_YYYYMMDD_HHMMSS.png`\n- Model comparison plots\n- Feature importance plots\n\n## 🔧 Configuration\n\nEdit `config.py` to customize:\n\n- Model parameters\n- Data paths\n- Output directories\n- Hyperparameter tuning ranges\n- Logging settings\n\n## 📝 Usage Examples\n\n### Basic Usage\n\n```python\nfrom src.model.models import run_complete_pipeline\nimport pandas as pd\n\n# Load data\ntrain_df = pd.read_csv('data/processed/feature_engineered_train.csv')\ntest_df = pd.read_csv('data/processed/feature_engineered_test.csv')\n\n# Run complete pipeline\nresults = run_complete_pipeline(\n    train_df=train_df,\n    test_df=test_df,\n    models_to_run=['LINREG', 'RIDGE', 'XGB'],\n    tune_xgb=True,\n    create_submission=True\n)\n```\n\n### Individual Components\n\n```python\nfrom src.model.models import run_regression_models, predict_duration, to_submission\n\n# Train models\nmodels = run_regression_models(train_df, ['XGB', 'RF'])\n\n# Make predictions\npredictions = predict_duration(models['XGBoost'], test_df)\n\n# Create submission\nsubmission_file = to_submission(predictions)\n```\n\n### Hyperparameter Tuning\n\n```python\nfrom src.model.models import hyperparameter_tuning_xgb\n\n# Tune XGBoost\nbest_model, best_params, best_rmse = hyperparameter_tuning_xgb(train_df)\nprint(f\"Best RMSE: {best_rmse}\")\nprint(f\"Best parameters: {best_params}\")\n```\n\n## 🎨 Features\n\n### Data Processing\n\n- **Feature Engineering**: Distance calculations, time features, weather data\n- **Normalization**: Custom normalization for different feature types\n- **Data Validation**: Automatic data quality checks\n\n### Model Training\n\n- **Multiple Algorithms**: 7 different regression models\n- **Hyperparameter Tuning**: Automated XGBoost optimization\n- **Cross-Validation**: Built-in validation splits\n- **Progress Tracking**: Detailed logging with sandwich format\n\n### Evaluation\n\n- **Comprehensive Metrics**: RMSE, MAE, R², MAPE\n- **Visual Comparisons**: Histogram comparisons, feature importance\n- **Model Persistence**: Save and load trained models\n\n### Output\n\n- **Submission Files**: Ready-to-submit CSV files\n- **Visualizations**: Plots and charts for analysis\n- **Logging**: Complete audit trail\n\n## 🐛 Troubleshooting\n\n### Common Issues\n\n1. **Missing Data Files**\n\n   ```\n   FileNotFoundError: Data file not found\n   ```\n\n   Solution: Ensure all required data files are in the correct directories\n\n2. **Import Errors**\n\n   ```\n   ModuleNotFoundError: No module named 'xgboost'\n   ```\n\n   Solution: Install missing dependencies: `pip install -r requirements.txt`\n\n3. **Memory Issues**\n   ```\n   MemoryError: Unable to allocate array\n   ```\n   Solution: Reduce batch size or use fewer models\n\n### Getting Help\n\n- Check logs in `logs/main.log` for detailed error messages\n- Verify data files are in correct format and location\n- Ensure all dependencies are installed correctly\n\n## 📊 Performance\n\nTypical model performance on validation set:\n\n- **XGBoost**: ~400-450 RMSE (best performer)\n- **Random Forest**: ~420-470 RMSE\n- **Linear Models**: ~450-500 RMSE\n- **Neural Network**: ~430-480 RMSE\n\n## 🔮 Future Enhancements\n\n- [ ] Automated feature selection\n- [ ] Real-time prediction API\n- [ ] Model monitoring dashboard\n- [ ] A/B testing framework\n\n  ## 📄 License\n\n  This project is licensed under the MIT License - see the LICENSE file for details.\n\n  ## 🤝 Contributing\n\n  Please read [CONTRIBUTING.md](CONTRIBUTING.md). By participating, you agree to abide by our [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) and report vulnerabilities per [SECURITY.md](SECURITY.md).\n\n  1. Fork the repository\n  2. Create a feature branch\n  3. Make your changes\n  4. Add tests if applicable\n  5. Submit a pull request\n\n  ## 📞 Support\n\nFor questions or issues, please:\n\n1. Check the logs first\n2. Review this documentation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshitaphadtare%2Fgopredict","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharshitaphadtare%2Fgopredict","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshitaphadtare%2Fgopredict/lists"}