An open API service indexing awesome lists of open source software.

https://github.com/sushmitha-93/predict-future-sales


https://github.com/sushmitha-93/predict-future-sales

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# Predict-Future-Sales

Sales forecasting is a frequent application of Machine Learning. Businesses can use this forecasting to identify benchmarks, determine incremental impacts of new initiatives, plan resources in response to expected demand, and project future budgets. This report provides detailed Machine Learning based solutions to the Kaggle competition - Predict Future Sales . I have implemented different Machine Learning models to produce forecasted output for the given dataset and concluded the best results with Kaggle ranking.

Problem Statement and Methods applied


The task is to forecast the total amount of products sold in every shop for the test set. We have applied following models:


  1. Prophet

  2. LSTM

  3. ARIMA

  4. LightGBM

  5. XGBoost

Check out detailed report of each model implemented here.

Data Description


The dataset used is time-series daily historical sales data. It consists of 2,170 items sold by 60 shops between January 2013 to October 2015.


The data set consist of 6 csv extension files which are given below with their descriptions:




  1. sales_train.csv: This is the training set which consist of historical data from January 2013 to October 2015.


  2. test.csv: This is the test set. We are expected to forecast the sales for these given shops and products for November 2015.


  3. sample_submission.csv: This file exhibits the correct format expected.


  4. items.csv: supplemental information about the items categories.


  5. shops.csv: supplemental information about the shops.



The Data fields present in these with their descriptions are given below:


  1. ID - an Id that represents a (Shop, Item) tuple within the test set


  2. shop_id - unique identifier of a shop


  3. item_id - unique identifier of a product


  4. item_category_id - unique identifier of item category


  5. item_cnt_day - number of products sold. You are predicting a monthly amount of this measure


  6. item_price - current price of an item


  7. date - date in format dd/mm/yyyy


  8. date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1 , . . . , October 2015 is 33


  9. item_name - name of item


  10. shop_name - name of shop


  11. item_category_name - name of item category


The dataset can be download from here.

Objective


The competition requires us to predict the future total sales to happen in the next month (Nov 2015) for every item and store in the test file.

The submissions are evaluated by root mean squared error (RMSE) with true target values clipped into [0,20] range.

Litrature Review


Why predict future sales?


Sales forecasting is a technique that uses historical sales data as inputs to make informed predictions about the direction of future trends.



  • Manage supply chain efficiently: Knowing future consumer trends allow business’ sales operations align their supply chain activities efficiently like material purchases, inventory stocking, warehouse capacity plans, hiring and handle the market demand most efficiently with improvised decision making.


  • Make higher Revenue: It enables companies to focus their sales team on high-profit sales opportunities resulting in higher revenue.


  • Incorporate right changes: It enables companies to incorporate the right changes like pricing, marketing, product changes, locations, hiring etc for improved business outcomes.


Different Sales Forecasting Techniques:



  1. Qualitative Methods:



    • Market Research: A systemic process of actively surveying or interviewing potential customers to determine the interest of service or product.


    • Delphi Method: A panel of experts is interviewed by a sequence of questionnaires enabling forecaster to have all information for forecasting.


    • Visionary Forecast: It is a non-scientific method where a 'visionary' or 'futurologist' attempt to forecast through subjective opinion, guesswork and imagination.


  2. Time Series Analysis and Projection:



    • Moving Average: It is technical indicator that investors and traders use to determine trend direction, and seasonal irregularities. It is calculated by adding up data points during specific period and dividing by number of time periods.


    • Exponential smoothing: This is similar to moving average except that more recent data points are given more weight. Applied mainly for production and inventory control.


    • Trend Projection: This technique fits a trend line to a mathematical equation and then projects it into the future by means of this equation. It is typically used to forecast new-products and long-term sales.


  3. Casual methods:



    • Regression model: This functionally relates sales to other economic or internal variables to estimate an equation using least-square error technique. It is good for short-term predictions.


    • Life-cycle Analysis: The product acceptance by various groups is analysed to forecast product growth rates.



Data Exploration and Data Pre-processing:3>

Data Exploration:


The very first step in data analysis is to explore and visualise the unstructured data to uncover patterns, characteristics, and points of interest. It creates a broader picture of important trends and points that require further study. It also gives us an idea of the amount of cleaning required in the data. Given below are some glimpses of data set.


Since the original dataset was in Russian language, in order to better understand the data and to see if there is any scope for feature extraction, it is translated to English using translation tools




From exploration we can conclude that we only need to forecast sales for 5,100 items for 42 shops. Hence, we may not include all shops and items to reduce computing resources required to train models.

Visualizing Data


For data visualization, we have mainly used Plotly, Seaborn and Matplotlib python visualization libraries.

1. Distribution of items in each category: We can see the distribution of items among 83 categories. Item category 40 has highest number of items.

2. Total sales made by each shop over the span of 34 months.

3.Below plot shows the total number of items sold by each shop on a day over the span of 34 months. We can see that shop no. 9 opens sporadically and makes huge sales when opened. We also see peaks in the year ends sales.

4. Plot of item prices of each item: We can most items are pretty much in same price range except for one item (6606) which is way high. This is clearly an outlier.

5. Below plot shows the total sales and number of items sold in a month: We can see there is seasonality in the sales trend. The sales seem to peak in the year end, and then follows a decreasing trend.

Data cleaning:


Data in its true form is raw and not usable. It needs to be cleaned and produced in a form that is more readable and usable. The practice of modifying or altering data in order to make it more understandable and structured is known as data manipulation. It enhances the quality of the data for future modelling purposes. Following are some of the steps performed to clean the data

From above plots, we can see there are clearly some outliers that needs to be treated.

An outlier is an observation or value that lies an abnormal distance from other values in a given sample. These are stranglers that can be extremely high values or extremely low values. This can be variability in the measurement or it can sometimes indicate an error during experiment. Usually, outliers can lead to misleading interpretations and hence are advised to be removed before training a model.

Detecting outliers using box plot.


1. Box plot of item_price feature: We see there is one particular item having price above 300k, far away from rest of the sample.

2. Box plot of item_cnt_day feature: We see there is one sample with item count more than 2000.

We simply remove these outlier samples as they can skew the training considerably


3. Handling negative values: We see train data has samples with negative item_price and negative item count.


Since item price and count is not fixed and vary with months, we have handled them by making the negative values to null and imputing them using Scikit-learn’s KNNImputer that uses K-Nearest Neighbour algorithm to assign null values with values of it’s closes neighbouring sample.


4.Handling null values: This dataset has no null values.


Problem Statement and Methods applied


Our task is to forecast the total amount of products sold in every shop for the test set. We had applied following models:


  1. Prophet

  2. LSTM

  3. ARIMA

  4. LightGBM

  5. XGBoost

Check out detailed report of each model implemented here.