https://github.com/vardhin/predictive-casestudy

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/vardhin/predictive-casestudy
Owner: vardhin
Created: 2026-04-16T12:49:35.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-04-16T13:06:12.000Z (3 months ago)
Last Synced: 2026-04-16T15:09:22.471Z (3 months ago)
Language: Python
Size: 8.91 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Case Study 16 — Rossmann Store Sales Forecasting

**22BDS0114 — GSVARDHIN**

End-to-end retail-sales forecasting on the **Rossmann Store Sales** dataset

(Kaggle): EDA, decomposition, stationarity testing, five forecasting models,

walk-forward evaluation, and an interactive Streamlit UI.

---

## Why this case study

The original AirPassengers (144 monthly points) is a textbook toy. Real retail

forecasting is *messier* — multiple stores, daily granularity, promos,

holidays, store-type heterogeneity, and zero-sales days when stores close.

Rossmann gives us all of that on **1,115 stores × 942 daily observations**

(~1M rows), so we can compare classical and ML models on a realistic problem.

---

## Dataset

| Detail | Value |

|---|---|

| Source | [Rossmann Store Sales — Kaggle](https://www.kaggle.com/c/rossmann-store-sales) |

| Files used | `train.csv` (sales), `store.csv` (store metadata) |

| Stores | 1,115 |

| Daily observations | 1,017,209 rows |

| Period | 2013-01-01 → 2015-07-31 |

| Target | `Sales` (€) |

| Exogenous | `Promo`, `SchoolHoliday`, `StateHoliday`, `Promo2`, `CompetitionDistance`, `StoreType`, `Assortment` |

**The CSVs are not committed.** Download from Kaggle and place `train.csv`,

`store.csv`, (optionally `test.csv`) in the project root.

---

## Architecture

```

predictive-casestudy/

├── main.py              # CLI: runs full headless pipeline, saves PNG/CSV to outputs/

├── app.py               # Streamlit multi-page UI

├── src/

│   ├── data.py          # Load, merge, feature engineering (cached as parquet)

│   ├── eda.py           # Plotly figures + ADF stationarity test

│   └── models.py        # Naive, Holt-Winters, SARIMA, Prophet, XGBoost + walk-forward CV

├── pyproject.toml

├── train.csv  store.csv  test.csv      # (gitignored — fetched from Kaggle)

├── cache/                              # parquet cache of engineered data

└── outputs/                            # generated charts + leaderboard CSVs

```

---

## Models compared

| Model | Type | Notes |

|---|---|---|

| **Naive (seasonal-7)** | Baseline | `y_t = y_{t-7}` — must be beaten |

| **Holt-Winters** | Classical ETS | Triple exponential smoothing, weekly seasonality |

| **SARIMA(1,1,1)(1,1,1,7)** | Classical | Seasonal ARIMA on daily data |

| **Prophet** | Decomposable | Trend + weekly + yearly + Promo/School regressors |

| **XGBoost** | ML | Lag (1/7/14/28) + rolling stats + calendar/promo features, recursive forecast |

All models share a common `fit(train) → predict(future_index)` interface in

[src/models.py](src/models.py), so adding a new model is one class.

### Metrics

MAE · RMSE · MAPE · **RMSPE** (Rossmann competition's official metric).

Walk-forward evaluation across `n` folds is available via

`models.walk_forward_eval()`.

---

## How to run

```bash

# Install deps (uv handles venv automatically)

uv sync

# Headless analysis on the network-wide aggregate (writes to ./outputs/)

uv run main.py

# Or analyse a single store:

uv run main.py 1 28        # store_id=1, horizon=28 days

# Interactive Streamlit UI

uv run streamlit run app.py

```

The Streamlit app has six pages — Overview, EDA, Decomposition & Stationarity,

Forecasting, Model Comparison, Future Forecast — and a sidebar to switch

between **Network-wide** and **Single-store** scope.

---

## Limitations

- Holt-Winters & SARIMA assume regular sampling and can be slow on long daily

  series — we cap CV folds and horizons accordingly.

- XGBoost recursive forecasting compounds errors over long horizons; for >30 days

  it loses to Holt-Winters on the network-wide series.

- Exogenous regressors (`Promo`, `SchoolHoliday`) need to be known in advance for

  out-of-sample prediction; we substitute day-of-week medians for the future.

- We do not model store closures during the Rossmann refurbishment in 2014; the

  raw zeros are filtered out by `only_open=True` in the per-store path.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vardhin/predictive-casestudy

Awesome Lists containing this project

README