An open API service indexing awesome lists of open source software.

https://github.com/nadirrezaou/demand-forecasting-with-random-forest

Forecasting product demand using Random Forest and sales data preprocessing.
https://github.com/nadirrezaou/demand-forecasting-with-random-forest

data-science demand-forecasting machine-learning python random-forest regression sales-prediction sklearn

Last synced: about 1 month ago
JSON representation

Forecasting product demand using Random Forest and sales data preprocessing.

Awesome Lists containing this project

README

          

# Demand Forecasting with Random Forest

Demand forecasting is the process of estimating **future customer demand** over a specific period by analyzing historical sales data and related features.

Traditionally, organizations use statistical forecasting methods such as **ARIMA, SARIMA, and Moving Averages**.
However, these methods often require significant domain expertise and manual tuning.

With the rise of **Machine Learning**, new approaches have emerged that can automatically learn patterns from data and provide more accurate forecasts.

---

## Table of Contents

- [Goal](#goal)
- [Data](#data)
- [Workflow](#workflow)
- [Result](#result)
- [Required Packages](#required-packages)

---

## Goal

The goal of this project is to explore the use of **machine learning models**, specifically the **Random Forest Regressor**, for predicting product demand.

Unlike traditional approaches, machine learning models can:
- Handle **large datasets**
- Capture **complex relationships**
- Process **categorical features** with minimal manual intervention

By building and tuning a Random Forest model, we aim to improve the accuracy of demand forecasts and reduce prediction errors.

---

## Data

The dataset consists of daily sales records with the following fields:

- `record_ID` – Unique record identifier
- `week` – Date
- `store_id` – Store identifier
- `sku_id` – Product identifier
- `total_price` – Final price after discounts
- `base_price` – Original price
- `is_featured_sku` – Whether the product was featured
- `is_display_sku` – Whether the product was displayed
- `units_sold` – **Target variable** (number of units sold)

---

## Workflow

### Data Preprocessing
- Split `week` column into `day`, `month`, `year`
- Handle missing values
- Remove outliers (top 1% sales)
- Drop irrelevant features (`record_ID`)

### Feature Engineering
- One-hot encode categorical variables (`store_id`, `sku_id`)

### Regression Modeling
- Split dataset into training and testing sets
- Train a **Random Forest Regressor**
- Evaluate performance with **R² score** and **RMSE**

### Hyperparameter Tuning
- Use **GridSearchCV** to optimize parameters:
- `n_estimators` (number of trees)
- `min_samples_split` (minimum samples per split)

### Visualization
- Plot **predicted vs actual sales**
- Explore feature distributions and sales patterns

---

## Result

- The model successfully predicts demand with a reasonable **R² score** and reduced **RMSE** compared to baseline.
- After **hyperparameter tuning**, the model achieves even better accuracy.
- Future improvements may include advanced models such as **XGBoost** or **Neural Networks**.

---

## Required Packages

```txt
numpy
pandas
scikit-learn
matplotlib