Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/btrotta/kaggle-m5

Top 3% solution for Kaggle M5 Accuracy competition
https://github.com/btrotta/kaggle-m5

Last synced: 12 days ago
JSON representation

Top 3% solution for Kaggle M5 Accuracy competition

Host: GitHub
URL: https://github.com/btrotta/kaggle-m5
Owner: btrotta
Created: 2020-07-01T06:53:33.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-07-01T07:01:51.000Z (over 4 years ago)
Last Synced: 2024-12-22T00:23:43.809Z (23 days ago)
Language: Python
Homepage:
Size: 9.77 KB
Stars: 23
Watchers: 2
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

# Kaggle M5 competition: Walmart store forecasting

Top 4% solution for the Kaggle M5 (Accuracy) competition. The competition requires predicting store sales of individual
items over a prediction period of 28 days.

## Modelling approach

The code is quite short (<300 lines) and uses only fairly basic features in a LightGBM model. I didn't use any "magic" adjustment
factors. I also didn't use any custom metrics, just rmse. I think the evaluation metric is noisy, especially for features
with short history, because random fluctuations in the day-to-day sales history can cause products to be weighted very
differently even if they have similar long-term average. So I thought trying to optimise for this metric would lead to
overfitting.

Rather than using a recursive approach, I trained separate models for each day of the forecasting horizon, and for each `n` I recalculated the features
so that the `n`-day-ahead model is trained on data that has been lagged by `n` days. Based on discussions in the forum (specifically,
this post https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/144067#),
I decided that the recursive approach was only performing well on the training period by coincidence.

I noticed that in the test period, there are very few new products (i.e. products that have not been sold
before the test period). So I excluded from the training set rows before the first sale date of a product in a
store, and also excluded these rows when calculating aggregate features.

I used 3 years of data to calculate the features (to reduce noise and capture seasonal trends) and 1 year to actually
train the model. I excluded December from the training period because of the effect of Christmas.

## Features

The feature engineering is mainly common sense: as well as the obvious date features, just lagged sales at various
levels of aggregation. For the aggregated features, I took the mean of sales at 3 levels of aggregation:
- item and store
- item (aggregated over all stores)
- dept id and store id

The idea of this was that the higher levels of aggregation provide a less noisy view of item-level and store-level trends.

Specifically, the features are:
- dept_id and store_id (categorical)
- day of week, month, snap (i.e. is today a snap day for the current store)
- days since product first sold in that store
- price relative to price 1 week and 2 weeks ago
- item-level holiday adjustment factor (for each holiday and each item, calculate the average change in sales in the week
leading up to the holiday and the holiday itself)
- long-term mean and variance of sales at the 3 levels of aggregation
- long-term mean and variance of sales at the 3 levels of aggregation for each day of week
- average of last 7, 14, and 28 days of sales at the 3 levels of aggregation
- average sales lagged 1-7 days at the 3 levels of aggregation