{"id":20439446,"url":"https://github.com/atharvapathak/sales_forecasting_project","last_synced_at":"2026-04-12T14:46:04.644Z","repository":{"id":196208043,"uuid":"694967859","full_name":"atharvapathak/Sales_Forecasting_Project","owner":"atharvapathak","description":"Forecasted product sales using time series models such as Holt-Winters, SARIMA and causal methods, e.g. Regression. Evaluated performance of models using forecasting metrics such as, MAE, RMSE, MAPE and concluded that Linear Regression model produced the best MAPE in comparison to other models","archived":false,"fork":false,"pushed_at":"2024-04-10T09:39:06.000Z","size":340,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-15T20:19:59.778Z","etag":null,"topics":["datamining","demand-forecasting","feature-engineering","machine-learning","machinelearning","python","regression-trees","retail","sales","sales-forecasting","seaborn","sklearn","statsmodels","time-series-analysis","time-series-decomposition"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/atharvapathak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-22T04:24:29.000Z","updated_at":"2024-06-26T15:41:18.000Z","dependencies_parsed_at":"2025-01-15T19:22:47.838Z","dependency_job_id":"b34fa8b1-93e1-4dd5-9016-940be098a32b","html_url":"https://github.com/atharvapathak/Sales_Forecasting_Project","commit_stats":null,"previous_names":["atharvapathak/sales_forecasting_project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atharvapathak%2FSales_Forecasting_Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atharvapathak%2FSales_Forecasting_Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atharvapathak%2FSales_Forecasting_Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atharvapathak%2FSales_Forecasting_Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/atharvapathak","download_url":"https://codeload.github.com/atharvapathak/Sales_Forecasting_Project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241988369,"owners_count":20053656,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datamining","demand-forecasting","feature-engineering","machine-learning","machinelearning","python","regression-trees","retail","sales","sales-forecasting","seaborn","sklearn","statsmodels","time-series-analysis","time-series-decomposition"],"created_at":"2024-11-15T09:17:26.290Z","updated_at":"2025-12-31T01:00:39.738Z","avatar_url":"https://github.com/atharvapathak.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sales_Forecasting_Project\n\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fatharvapathak%2Fhit-counter\u0026count_bg=%2379C83D\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=hits\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n## Sales_Forecasting_Project\n\n[![Generic badge](https://img.shields.io/badge/Datascience-Beginners-Red.svg?style=for-the-badge)](https://github.com/atharvapathak) \n[![Generic badge](https://img.shields.io/badge/LinkedIn-Connect-blue.svg?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/in/atharva-pathak-126021119/) \n[![Generic badge](https://img.shields.io/badge/Python-Language-blue.svg?style=for-the-badge)](https://github.com/atharvapathak/Sales_Forecasting_Project)\n\n#### The goal of this project is to Predict the Future Sales [#DataScience](https://github.com/atharvapathak/Sales_Forecasting_Project) for the challenging time-series dataset consisting of daily sales data,\n\n[![GitHub repo size](https://img.shields.io/github/repo-size/atharvapathak/Sales_Forecasting_Project.svg?logo=github\u0026style=social)](https://github.com/atharvapathak) [![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/atharvapathak/Sales_Forecasting_Project.svg?logo=git\u0026style=social)](https://github.com/atharvapathak/)[![GitHub top language](https://img.shields.io/github/languages/top/atharvapathak/Sales_Forecasting_Project.svg?logo=python\u0026style=social)](https://github.com/atharvapathak)\n\n#### Few popular hashtags - \n### `#Sales Prediction` `#Time Series` `#Ensembling`\n### `#XGBoost` `#Parameter Tuning` `#LightGBM`\n\n### Motivation\nIn this competition I was working with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. \nTo predict total sales for every product and store in the next month. By solving this competition I was able to apply and enhance your data science skills.\n\nThis documentation contains general information about my approach and technical information about Kaggle’s Predict Future Sales competition\n\n### Steps involved in this project\n### Kaggle Predicting Future Sales- Playground Prediction Competition\n\n### Kaggle Competition: [Predict Future Sales](https://www.kaggle.com/c/competitive-data-science-predict-future-sales)\n### Data Description\n\nYou are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.\n\n**File descriptions**\n```\n- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.\n- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.\n- sample_submission.csv - a sample submission file in the correct format.\n- items.csv - supplemental information about the items/products.\n- item_categories.csv  - supplemental information about the items categories.\n- shops.csv- supplemental information about the shops.\n```\n\n**Data fields**\n```\n- ID - an Id that represents a (Shop, Item) tuple within the test set\n- shop_id - unique identifier of a shop\n- item_id - unique identifier of a product\n- item_category_id - unique identifier of item category\n- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure\n- item_price - current price of an item\n- date - date in format dd/mm/yyyy\n- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33\n- item_name - name of item\n- shop_name - name of shop\n- item_category_name - name of item category\n```\n## I. Summary\n- Main methods I used for this competition that provides the desired Leaderboard score: LightGBM\n- Methods I tried to implement but resulted in worse RMSE: XGBoos, Stacking (both simple averaging and metal models such as Linear Regression and shallow random forest)\n- The most important features are lag features of previous months, especially the ‘item_cnt_day’ lag features. Some of them, which can be found in my lag dataset, are \n  - **target_lag_1,target_lag_2**: item_cnt_day of each shop – item pair of previous month and previous two months\n  - **item_block_target_mean_lag_1, item_block_target_sum_lag_1**: sum and mean of item_cnt_day per item of previous month\nImportant features are measured from LightGBM model\n- Tools I used in this competition are: numpy, pandas, sklearn, XGBoost GPU, LightGBM (running Pytorch)\n- All models are tuned on a linux server with Intel i5 processor, 16GB RAM, NVIDIA 1080 GPU. Tuning models took about 8 to 10 hours, and training on the whole dataset took \u003c=5 minutes\n\n\n## II. Exploratory Data Analysis\nMore information can be found in [EDA notebook](EDA.ipynb)\n\nBasic data analysis is done, including plotting sum and mean of item_cnt_day for each month to find some patterns, exploring missing values, inspecting test set …\n\nHere are few things interesting I found from doing EDA:\n- Number of sold items declines over the year\n- There are peaks in November and similar item count zic-zac behaviors in June-July-August. This inspires me to look up Russia national holiday and create a Boolean holiday features. More information can be found in ‘Feature Engineering’ section\n- Data has no missing values\n- Some interesting information from test set analysis:\n  - Not all shop_id in training set are used in test set. Test set excludes following shops (but not vice versa): [0, 1, 8, 9, 11, 13, 17, 20, 23, 27, 29, 30, 32, 33, 40, 43, 51, 54]\n  - Not all item in train set are in test set and vice versa\n  - In test set, a fixed set of items (5100) are used for each shop_id, and each item only appears one per each shop. This possibly means that items are picked from a generator, which will result in lots of 0 for item count. Therefore, generating all possible shop-item pairs for each month in train set and assigning missing item count with 0 makes sense.\n\n\n## III. Feature Engineering\n\n### 1. Generate all shop-item pairs and Mean Encoding\nSince the competition task is to make a monthly prediction, we need to aggregate the data to monthly level before doing any encodings\n\nItem counts for each shop-item pairs per month (‘target’). I also generated sum and mean of item counts for each shop per month (‘shop_block_target_sum’,’shop_block_target_mean’), each item per month (‘item_block_target_sum’,’item_block_target_mean’, and each item category per month (‘item_cat_block_target_sum’,’item_cat_block_target_mean’)\n\nThis process can be found in [this notebook](generate_lag_features.ipynb), under ‘Generating new_sales.csv’. Datasets generated from this steps will be saved under the name ‘new_sales.csv’\n\n### 2. Generate lag features\nLag features are values at prior time steps. I am generating lag features based on ‘item_cnt’ and grouped by ‘shop_id’ and ‘item_id’ .  Time steps are: 1,2,3,5 and 12 months.\n\nAll sale record before 2014 are dropped, since there would be no lag features before 2014 as we have a 12-month lag.\n\nThese lag features turn out to be the most important features in my dataset, based on gradient boosting’s importance features.\n\nMore information can be found in [this notebook](generate_lag_features.ipynb), under ‘Generate lag feature new_sales_lag_after12.pickle’\n\n### 3. Holiday Boolean features\nAs mentioned above, I look up few Russia national holidays and created few 5 more features: December (to mark December), Newyear_Xmas (for January), Valentine_Menday (February), Women_Day (March), Easter_Labor (April). This might help boosting my score a little since December feature seems to be helpful\n\nAfter all this steps, you should have a pickle file name in ‘data‘ directory: 'new_sales_lag_after12.pickle'. This is the main file I used for training models\n\n\n### IV. Cross validations\nSince this is time series so I have to pre-define which data can be used for train and test. I have a function called get_cv_idxs in utils.py that will return a list of tuples for cross validation. I decide to use 6 folds, from date_block_num 28 to 33, and luckily this CV score is consistent to leaderboard score.\n\nCV indices can be retrieved from this custom function:\n\n```\ncv = get_cv_idxs(dataframe,28,33) \n# dataframe must contain date_block_num features\n```\n\nResults from this function can be passed to sklearn GridSearchCV.\n\n### V. Training methods:\n\n### 1. LightGBM\nLightGBM is tuned using hyperopt, then manually tune with GridSearchCV to get the optimal result. One interesting thing I found: when tuning the size of the tree, it’s better to tune min_data_in_leaf instead of max_depth. This means to let the tree grows freely until the condition for min_data_in_leaf is met. I believe this will allow deeper logic to develop without overfitting too much. Colsample_bytree and subsample are also used to control overfitting. And I keep the learning rate small (0.03) throughout tuning.\n\nMean RMSE of 6 folds CV is 0.8088, which is better than any other models I used.\n\nYou can find more information in [LGB notebook](lightgbm_tuning.ipynb). From this file I also created out-of-fold features for block 29 to 33, which is used for ensembling later.\n\nAlso from this notebook, you can get the leaderboard submission under the file name: ‘coursera_tuned_lightgbm_basic_6folds.csv'\n\n(Note: I do not include some of hyper parameter tuning results from hyperopt since I tuned it at work and I do not have access to that machine now)\n\n\n### 2. XGBoost\nI ran the XGBoost with GPU version, and I follow the same tuning procedures as mentioned in LightGBM. For some reason, I can’t seem to get a consistent result while running XGBoost, even with the same parameters. One example is I get .812 CV score from hyperopt, but I can’t seem to get that result again when getting out-of-fold features (it jumps to .817). This never happens while using LightGBM.\n\nTherefore, I pick 2 models: one with max_depth tuned, and one without max_depth tuned, to get out-of-fold features and hoping they are different enough for ensembling. \n\nFor the record, the first models results .812 CV score (in hyperopt) and .926 LB score, and second models results in .813 CV score (hyperopt) and .927 LB score. Either way, both are worse than LGB model \n\n``` python \nspace = {\n    #'n_estimators': hp.quniform('n_estimators', 50, 500, 5),\n#     'max_depth': hp.choice('max_depth', np.arange(5, 10, dtype=int)),\n    'subsample': hp.quniform('subsample', 0.7, 0.9, 0.05),\n    'colsample_bytree': hp.quniform('colsample_bytree', 0.7, 0.9, 0.05),\n    'gamma': hp.quniform('gamma', 0, 1, 0.05),\n    'max_leaf_nodes': hp.choice('max_leaf_nodes', np.arange(100,140, dtype=int)),\n    'min_child_weight': hp.choice('min_child_weight', np.arange(100,140, dtype=int)),\n    'learning_rate': 0.03,\n    'eval_metric': 'rmse',\n    'objective': 'reg:linear' , \n    'seed': 1204,'tree_method':'gpu_hist'\n}\n\n```\nbest_hyperparams = optimize(space,max_evals=200)\nprint(\"The best hyperparameters are: \")\nprint(best_hyperparams)\n\nYou can find more information about this in [XGB notebook](xgb_tuning.ipynb). Prediction for the model with max_depth tuned are named ‘tuned_xgb_basicfeatures_6folds_8126.csv’ and the other one are ‘tuned_xgb_basicfeatures_6folds_8136’\n\n\n## VI. Ensembling\n\nWith LightGBM, XGB model-1 and XGB model-2 out-of-fold features from previous methods, I calculated pairwise differences between them, get the mean of all 3 LGB, XGB1 and XGB2 out-of-fold features, and include the most important features from feature importance: ‘target_lag_1’.\n\nFrom here I try few ensembling methods\n- Simple average and Weighted average \n- SKlearn linear regression and Elasticnet\n- Shallow Random Forest, tuned with 5 folds (from 29 to 33)\n\nAll of them results in RMSE score that is slightly more than the LightGBM best model, so LightGBM still outperforms them.\n\n``` python\nX,y = get_X_y_ensembling(all_oof_df)\nparams={'alpha': 0.0, 'fit_intercept': False, 'solver': 'sag','random_state':1402}\nlr = Ridge(**params)\nlr.fit(X,y)\ntest_pred =  lr.predict(test_df)\npd.Series(test_pred).describe()\nget_submission(test_pred,'ensembling_ridge');\n```\n\nMore information can be found in [Ensembling notebook](ensembling.ipynb)\n\n## VII. Improvement:\n\nFew things that can be improved are:\n- Implement neural net WITHOUT categorical embedding\n- Generate more feature related to holiday, such as: differences between current month and holiday month.\n- Translate item name to English and perform sentiment analysis on item name\n- Use only subset of those meta features for ensembling\n\n\n### Libraries Used\n\n![Ipynb](https://img.shields.io/badge/Python-datetime-blue.svg?style=flat\u0026logo=python\u0026logoColor=white) \n![Ipynb](https://img.shields.io/badge/Python-pandas-blue.svg?style=flat\u0026logo=python\u0026logoColor=white)\n![Ipynb](https://img.shields.io/badge/Python-numpy-blue.svg?style=flat\u0026logo=python\u0026logoColor=white) \n![Ipynb](https://img.shields.io/badge/Python-matplotlib-blue.svg?style=flat\u0026logo=python\u0026logoColor=white) \n![Ipynb](https://img.shields.io/badge/Python-seaborn-blue.svg?style=flat\u0026logo=python\u0026logoColor=white)\n![Ipynb](https://img.shields.io/badge/Python-scipy-blue.svg?style=flat\u0026logo=python\u0026logoColor=white) \n![Ipynb](https://img.shields.io/badge/Python-sklearn-blue.svg?style=flat\u0026logo=python\u0026logoColor=white) \n\n\n### Installation\n\n- Install **datetime** using pip command: `from datetime import datetime`\n- Install **pandas** using pip command: `import pandas as pd`\n- Install **numpy** using pip command: `import numpy as np`\n- Install **matplotlib** using pip command: `import matplotlib`\n- Install **matplotlib.pyplot** using pip command: `import matplotlib.pyplot as plt`\n- Install **seaborn** using pip command: `import seaborn as sns`\n- Install **os** using pip command: `import os`\n- Install **scipy** using pip command: `from scipy import sparse`\n- Install **scipy.sparse** using pip command: `from scipy.sparse import csr_matrix`\n- Install **sklearn.decomposition** using pip command: `from sklearn.decomposition import TruncatedSVD`\n- Install **sklearn.metrics.pairwise** using pip command: `from sklearn.metrics.pairwise import cosine_similarity`\n- Install **itertools** using pip command: `from itertools import product`\n\n\n### How to run?\n\n[![Ipynb](https://img.shields.io/badge/Prediction-Sales.Python-lightgrey.svg?logo=python\u0026style=social)](https://github.com/atharvapathak/Sales_Forecasting_Project)\n\n\n### Project Reports\n\n[![report](https://img.shields.io/static/v1.svg?label=Project\u0026message=Report\u0026logo=microsoft-word\u0026style=social)](https://github.com/atharvapathak/Sales_Forecasting_Project/)\n\n- [Download](https://github.com/atharvapathak/Sales_Forecasting_Project/') for the report.\n\n \n### Related Work\n\n[![Sales Prediction](https://img.shields.io/static/v1.svg?label=Sales\u0026message=Prediction\u0026color=lightgray\u0026logo=python\u0026style=social\u0026colorA=critical)](https://www.linkedin.com/in/atharva-pathak-126021119/) [![GitHub top language](https://img.shields.io/github/languages/top/atharvapathak/Sales_Forecasting_Project.svg?logo=php\u0026style=social)](https://github.com/atharvapathak/)\n\n[Sales Prediction](https://github.com/atharvapathak/Sales_Forecasting_Project) - A Detailed Report on the Analysis\n\n\n### Contributing\n\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?logo=github)](https://github.com/atharvapathak/Sales_Forecasting_Project/pulls) [![GitHub issues](https://img.shields.io/github/issues/atharvapathak/Sales_Forecasting_Project?logo=github)](https://github.com/atharvapathak/Sales_Forecasting_Project/issues) ![GitHub pull requests](https://img.shields.io/github/issues-pr/atharvapathak/Sales_Forecasting_Project?color=blue\u0026logo=github) \n[![GitHub commit activity](https://img.shields.io/github/commit-activity/y/atharvapathak/Sales_Forecasting_Project?logo=github)](https://github.com/atharvapathak/Sales_Forecasting_Project/)\n\n- Clone [this](https://github.com/atharvapathak/Sales_Forecasting_Project/) repository: \n\n```bash\ngit clone https://github.com/atharvapathak/Sales_Forecasting_Project.git\n```\n\n- Check out any issue from [here](https://github.com/atharvapathak/Sales_Forecasting_Project/issues).\n\n- Make changes and send [Pull Request](https://github.com/atharvapathak/Sales_Forecasting_Project/pull).\n \n### Need help?\n\n [![LinkedIn](https://img.shields.io/static/v1.svg?label=connect\u0026message=@atharvapathak\u0026color=success\u0026logo=linkedin\u0026style=flat\u0026logoColor=white\u0026colorA=blue)](https://www.linkedin.com/in/atharva-pathak-126021119/)\n\n:email: Feel free to contact me @ [atharvapathakb2w@gmail.com](https://mail.google.com/mail/)\n\n[![GMAIL](https://img.shields.io/static/v1.svg?label=send\u0026message=atharvapathakb2w@gmail.com\u0026color=red\u0026logo=gmail\u0026style=social)](https://www.github.com/atharvapathak) [![Twitter Follow](https://img.shields.io/twitter/follow/pathak_atharva?style=social)](https://twitter.com/pathak_atharva)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatharvapathak%2Fsales_forecasting_project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fatharvapathak%2Fsales_forecasting_project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatharvapathak%2Fsales_forecasting_project/lists"}