{"id":20337968,"url":"https://github.com/UznetDev/Wild-Blueberry-Prediction","last_synced_at":"2025-05-08T02:31:41.913Z","repository":{"id":262044585,"uuid":"885790002","full_name":"UznetDev/Wild-Blueberry-Prediction","owner":"UznetDev","description":"This project focuses on building a regression model to predict crop yield ('yield') using a dataset with various agricultural metrics. We employ extensive data analysis, feature engineering, and model tuning to minimize the model's error.","archived":false,"fork":false,"pushed_at":"2024-11-24T15:06:18.000Z","size":21416,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-31T06:46:25.993Z","etag":null,"topics":["ai","ana","classic-model","machine-learning","prediction","regression","regression-models","yield"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UznetDev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-09T12:04:39.000Z","updated_at":"2024-11-24T15:06:21.000Z","dependencies_parsed_at":"2024-11-24T16:29:29.131Z","dependency_job_id":null,"html_url":"https://github.com/UznetDev/Wild-Blueberry-Prediction","commit_stats":null,"previous_names":["uznetdev/m5-h2-regression-competition","uznetdev/wild-blueberry-prediction"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UznetDev%2FWild-Blueberry-Prediction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UznetDev%2FWild-Blueberry-Prediction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UznetDev%2FWild-Blueberry-Prediction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UznetDev%2FWild-Blueberry-Prediction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UznetDev","download_url":"https://codeload.github.com/UznetDev/Wild-Blueberry-Prediction/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252986827,"owners_count":21836234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ana","classic-model","machine-learning","prediction","regression","regression-models","yield"],"created_at":"2024-11-14T21:11:04.278Z","updated_at":"2025-05-08T02:31:41.896Z","avatar_url":"https://github.com/UznetDev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# M5-H2-Regression-Competition\n\n## Project Overview\n\nThis project focuses on building a regression model to predict crop yield ('yield') using a dataset with various agricultural metrics. We employ extensive data analysis, feature engineering, and model tuning to minimize the model's error. The final model is stored as a serialized object for easy reuse in production.\n\n## Key Features\n\n- **Custom Preprocessing Pipelines:** Includes transformers for feature selection, feature engineering, and outlier handling.\n- **Extensive Feature Engineering:** New features are crafted based on domain knowledge to improve model performance.\n- **Pipeline Integration:** A single unified pipeline to streamline preprocessing, feature engineering, and model training.\n- **Hyperparameter Tuning:** Optimized hyperparameters for the RandomForestRegressor using advanced techniques optuna.\n\n## Repository Structure\n\n```\nM5-H2-Regression-Competition/\n├── notebooks/\n│   ├── EDA.ipynb                # Exploratory Data Analysis\n│   ├── Model.ipynb              # Model training and evaluation\n│   └── model_explain.ipynb      # Model explainability and interpretation\n└── data/\n│   ├── train.csv                # train.csv model trained datset\n│   └── test.csv                 # test csv\n├── README.md                    # Project documentation\n├── LICENSE                      # Project license (MIT)\n├── requirements.txt             # Required Python packages\n├── model.pkl                    # Final model\n```\n\n## Installation\n\n1. Clone the repository:\n\n   ```bash\n   git clone https://github.com/UznetDev/Wild-Blueberry-Prediction.git\n   ```\n\n2. Navigate to the project directory:\n\n   ```bash\n   cd M5-H2-Regression-Competition\n   ```\n\n3. Install the dependencies:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n## Usage\n\n### Loading and Using the Model\n\nTo use the model stored in `model.pkl`, follow these steps:\n\n```python\nimport dill as pickle\nimport pandas as pd\n\n# Load the trained model\nwith open('model.pkl', 'rb') as f:\n    model = pickle.load(f)\n\n# Prepare your test data (ensure it matches the training data format)\n# test_data must have ['seeds', 'fruitmass', 'fruitset', 'AverageOfUpperTRange']\ntest_data = pd.read_csv('data/test.csv')\n\n# Make predictions\npredictions = model.predict(test_data)\nprint(predictions)\n```\n\n## Model Details\n\nThe model pipeline integrates several custom transformers and a tuned RandomForestRegressor:\n\n```python\nclass ColumnSelector(BaseEstimator, TransformerMixin):\n    ...\nclass FeatureEngineer(BaseEstimator, TransformerMixin):\n    ...\nclass OutlierReplacer(BaseEstimator, TransformerMixin):\n    ...\n    \nmodel = Pipeline([\n    ('column_selector', ColumnSelector(columns=['seeds', 'fruitmass', 'fruitset', 'AverageOfUpperTRange'])),\n    ('outlier_replacer', OutlierReplacer()),\n    ('feature_engineer', FeatureEngineer()),\n    ('model', RandomForestRegressor(...))\n])\n```\n\n## Dataset\n\nThe dataset is loaded from `train.csv` and includes features related to crop characteristics and environmental conditions. Detailed exploratory data analysis (EDA) is documented in `notebook/EDA.ipynb`.\n\n## Feature Engineering\n\nThe project includes custom feature engineering steps, such as creating ratios and interactions between features (e.g., `FruitToSeedRatio`, `fruitset_seeds`). These are implemented within the `FeatureEngineer` class.\n\n## Models Used\n\nThe primary model used is a **RandomForestRegressor** with custom hyperparameters. The pipeline approach allows easy modification and extension of the model, making it robust for handling diverse datasets.\n\n## Hyperparameter Tuning\n\nHyperparameters for the RandomForestRegressor were optimized with settings such as:\n\n- `max_depth=9`\n- `n_estimators=497`\n- `max_features=0.809`\n- `min_samples_split=10`\n- `min_samples_leaf=4`\n- `criterion='absolute_error'`\n\nThese values were chosen to maximize model performance while preventing overfitting.\n\n## Evaluation\n\nThe model was evaluated using standard regression metrics, including RMSE, R^2 and MAE. Details on evaluation and insights are in `notebook/Model.ipynb`.\n\n### Model Explanation\n\nTo understand this model, we use two powerful model explainers: **SHAP** and **Permutation Importance**.\n\n1. **Permutation Importance**:\n   - **Purpose**: Permutation importance measures the impact of each feature on the model’s accuracy. It works by shuffling each feature and observing how much the model’s accuracy decreases. This technique helps identify which features are most crucial to the overall performance of the model.\n   - **Usage**: We calculate the importance of each feature using `permutation_importance` from `sklearn.inspection`.\n   - **Plot**: The permutation importance plot ranks features by their influence on model accuracy, making it easy to see which features are essential for the model’s performance.\n\n2. **SHAP (SHapley Additive exPlanations)**:\n   - **Purpose**: SHAP values explain individual predictions by showing the impact of each feature on the model’s output. This method highlights how each feature contributes to specific predictions.\n   - **Usage**: We use `shap.TreeExplainer` to analyze our model, showing the effect each feature has on the model output.\n   - **Plot**: The SHAP summary plot provides a bar chart, showing the average importance of each feature across all predictions, offering insights into which features are most influential.\n\nYou can understand model in `notebook/model_explain.ipynb`\n\n## Usage\n\nThe model is pre-trained and saved as `model.pkl`. Load and run it directly to make predictions on new data without retraining.\n\n## Results\n\nThe final model achieved strong results on the provided dataset, making it suitable for practical yield predictions in agricultural applications.\n\n## Contributing\n\nContributions are welcome! If you'd like to improve this project, please fork the repository and make a pull request.\n\n1. Fork the repository.\n2. Create a new branch for your feature or bug fix:\n   ```bash\n   git checkout -b feature-name\n   ```\n3. Commit your changes:\n   ```bash\n   git commit -m \"Add a new feature\"\n   ```\n4. Push to your branch:\n   ```bash\n   git push origin feature-name\n   ```\n5. Open a pull request.\n   \n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contact's\n\nIf you have any questions or suggestions, please contact:\n- Email: uznetdev@gmail.com\n- GitHub Issues: [Issues section](https://github.com/UznetDev/Wild-Blueberry-Prediction/issues)\n- GitHub Profile: [UznetDev](https://github.com/UznetDev/)\n- Telegram: [UZNet_Dev](https://t.me/UZNet_Dev)\n- Linkedin: [Abdurakhmon Niyozaliev](https://www.linkedin.com/in/abdurakhmon-niyozaliyev-%F0%9F%87%B5%F0%9F%87%B8-66545222a/)\n\n---\n\nThank you for your interest in this project. We hope it helps in your journey to understand and predict smoking habits using data science!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUznetDev%2FWild-Blueberry-Prediction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FUznetDev%2FWild-Blueberry-Prediction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUznetDev%2FWild-Blueberry-Prediction/lists"}