Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/canerturkseven/forecastflowml
🧙 Scalable machine learning forecasting framework with Pyspark
https://github.com/canerturkseven/forecastflowml
forecasting lightgbm machine-learning python spark time-series xgboost
Last synced: 3 months ago
JSON representation
🧙 Scalable machine learning forecasting framework with Pyspark
- Host: GitHub
- URL: https://github.com/canerturkseven/forecastflowml
- Owner: canerturkseven
- License: mit
- Created: 2023-01-18T11:44:41.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2023-06-06T19:51:04.000Z (over 1 year ago)
- Last Synced: 2024-10-15T20:16:59.720Z (3 months ago)
- Topics: forecasting, lightgbm, machine-learning, python, spark, time-series, xgboost
- Language: Python
- Homepage: https://forecastflowml.readthedocs.io/en/latest
- Size: 152 MB
- Stars: 4
- Watchers: 3
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## ForecastFlowML: Scalable Machine Learning Forecasting with PySpark
[![Python Versions](https://img.shields.io/badge/python-3.7%20|%203.8%20|%203.9%20|%203.10%20|%203.11%20-blue)](https://www.python.org/downloads/) ![Tests](https://github.com/canerturkseven/ForecastFlowML/actions/workflows/tests.yml/badge.svg) [![codecov](https://codecov.io/github/canerturkseven/ForecastFlowML/branch/master/graph/badge.svg?token=DKAE8VSQ1M)](https://codecov.io/github/canerturkseven/ForecastFlowML) [![Documentation Status](https://readthedocs.org/projects/forecastflowml/badge/?version=latest)](https://forecastflowml.readthedocs.io/en/latest/?badge=latest)
ForecastFlowML is a scalable machine learning forecasting framework that enables parallel training (by distributing models rather than data) of scikit-learn like models based on PySpark.
With ForecastFlowML, you can build scikit-learn like regressors as direct multi-step forecasters, and train a seperate model for each group in your dataset.
Our package leverages the power of PySpark to efficiently handle large datasets and enables distributed computing for faster model training.## Features
ForecastFlowML provides a range of features that make it a powerful and flexible tool for time-series forecasting, including:
- Scaleable and extensive time series feature engineering (lag, rolling mean/std, stockout, history length) with PySpark.
- Parallel model training per group in the dataset with Pyspark Pandas UDFs.
- Direct multi-step forecasting.
- Built-in time based cross-validation.
- Hyperparameter tuning for each group model with grid search.
- Supports `scikit-learn` like libraries such as `LightGBM` or `XGBoost`.## Documentation
Reach out to our latest documentation [here](https://forecastflowml.readthedocs.io/en/latest/).
### User Guides
[What is ForecastFlowML?](https://forecastflowml.readthedocs.io/en/latest/forecastflowml.html)
[Feature Engineering](https://forecastflowml.readthedocs.io/en/latest/notebooks/feature_engineering.html)
[Time Series Cross Validation](https://forecastflowml.readthedocs.io/en/latest/notebooks/cross_validation.html)
[Grid Search](https://forecastflowml.readthedocs.io/en/latest/notebooks/grid_search.html)
[Feature Importance](https://forecastflowml.readthedocs.io/en/latest/notebooks/feature_importance.html)
[Save/Load ForecastFlowML](https://forecastflowml.readthedocs.io/en/latest/notebooks/save_load.html)
### Examples
[Kaggle Walmart M5 Forecasting Competition (18th solution)](https://www.kaggle.com/code/canerturkseven/forecastflowml-m5-forecasting-accuracy)
[Retail Demand Forecasting](https://forecastflowml.readthedocs.io/en/latest/notebooks/retail_demand_forecasting.html)
## Installation
### ForecastFlowML installation
You can install the package using the following command:
```
pip install forecastflowml
```#### Check Java
Make sure you have installed Java 11. You can check whether you have Java or not with the following command:
```
java -version
```#### Set PYSPARK_PYTHON
In the python script, set PYSPARK_PYTHON environment variable to your Python executable path before creating the spark instance:
```
import sys
import os
from pyspark.sql import SparkSession
os.environ["PYSPARK_PYTHON"] = sys.executable
spark = SparkSession.builder.master("local[*]").getOrCreate()
```