https://github.com/lasithaamarasinghe/data_crunch_045
https://github.com/lasithaamarasinghe/data_crunch_045
Last synced: 17 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/lasithaamarasinghe/data_crunch_045
- Owner: LasithaAmarasinghe
- Created: 2025-04-02T11:10:46.000Z (28 days ago)
- Default Branch: main
- Last Pushed: 2025-04-03T14:34:40.000Z (27 days ago)
- Last Synced: 2025-04-03T15:37:39.752Z (27 days ago)
- Language: Python
- Size: 6.27 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Crunch - CSE, UoM
# Harveston Climate Prediction 🌾🌦️
## Overview
Harveston's climate is shifting unpredictably, affecting agriculture and food security. This project aims to develop time series forecasting models to predict five critical environmental variables:
- Average Temperature (°C)
- Radiation (W/m²)
- Rain Amount (mm)
- Wind Speed (km/h)
- Wind Direction (°)## Models Used
### 1. RandomForestRegressor ([Code 1](code%201.py))
- Uses **Random Forest**, an **ensemble learning model** that builds multiple decision trees and averages their predictions.
- Handles feature engineering, categorical encoding, and missing data preprocessing.### 2. LightGBM ([Code 2](code%202.py))
- Implements **LightGBM**, an optimized **ensemble gradient boosting model** that builds trees sequentially to improve predictions.
- Uses boosting techniques for efficient learning and high performance.
- Performs hyperparameter tuning and feature extraction.### 3. XGBoost ([Code 3](code%203.py))
- Uses **XGBoost**, another powerful **ensemble gradient boosting model** known for efficiency and regularization.
- Applies gradient boosting with optimized tree-building techniques.
- Handles categorical encoding, missing values, and feature engineering.### 4. LSTM ([Code 4](code%204.py))
- Uses an LSTM (Long Short-Term Memory) model, a type of **recurrent neural network** (RNN) designed for sequential data processing.
- Employs memory cells to capture long-term dependencies and temporal patterns in data.
- Trains the model with early stopping, batch processing, and validation to optimize performance.### 5. Gradient Boosting ([Code 5](code%205.py))
- Uses Gradient Boosting, an **ensemble learning technique** that builds trees sequentially to minimize prediction errors.
- Applies boosting by adjusting model weights iteratively to improve accuracy.
- Optimizes tree depth, learning rate, and subsampling for enhanced performance and generalization.### 6. BaggingRegressor ([Code 6](code%206.py))
- Implements Bagging, an ensemble learning method that trains multiple base regressors (Decision Trees) on random subsets of the data and averages their predictions.
- Enhances model stability, reduces variance, and improves generalization by combining multiple weak learners.
- Uses bootstrap sampling and feature selection to improve robustness and mitigate overfitting.### 7. Stack ([Code 7](code%207.py))
- Implements a stacked ensemble model that combines multiple base regressors to enhance prediction accuracy.
- Uses diverse base models, including Random Forest, Gradient Boosting, XGBoost, Ridge Regression, and Lasso Regression, to capture different aspects of the data.
- Employs k-fold cross-validation to generate out-of-fold predictions for training a meta-model (XGBoost) that learns from the base models' outputs.## Dataset
- The dataset contains historical environmental records from different kingdoms in Harveston.
- The test dataset includes `ID`, `Year`, `Month`, `Day`, and `kingdom`, requiring predictions for the five target variables.## Evaluation Metric
Predictions are evaluated using **Symmetric Mean Absolute Percentage Error (sMAPE)**:$$sMAPE = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|y_{true,i} - y_{pred,i}|}{(|y_{true,i}| + |y_{pred,i}|)/2}$$
The final score is the average sMAPE across all target columns.