Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xqnwang/darima
Distributed ARIMA Models
https://github.com/xqnwang/darima
arima distributed-computing spark time-series-forecasting
Last synced: 15 days ago
JSON representation
Distributed ARIMA Models
- Host: GitHub
- URL: https://github.com/xqnwang/darima
- Owner: xqnwang
- Created: 2020-06-30T07:33:48.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-10-17T01:09:21.000Z (about 2 years ago)
- Last Synced: 2024-08-01T16:36:08.123Z (3 months ago)
- Topics: arima, distributed-computing, spark, time-series-forecasting
- Language: Python
- Homepage:
- Size: 5.8 MB
- Stars: 27
- Watchers: 7
- Forks: 24
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# `darima`
Distributed ARIMA Models implemented with Apache Spark
# Introduction
DARIMA is designed to facilitate forecasting ultra-long time series by utilizing the industry-standard MapReduce framework. The algorithm is developed on Spark platform and both Python as well as R interfaces.
See [`darima`](darima) for developed functions used for implementing DARIMA models.
- [`model.py`](darima/model.py) : Train ARIMA models for each subseries and convert the trained models into AR representations (Mapper).
- [`dlsa.py`](darima/dlsa.py) : Combine the local estimators obtained in Mapper by minimizing the global loss function (Reducer).
- [`forecast.py`](darima/forecast.py) : Forecast the next H observations by utilizing the combined estimators.
- [`evaluation.py`](darima/evaluation.py) : Calculate the forecasting accuracy in terms as MASE, sMAPE and MSIS.
- [`R`](darima/R) : R functions designed for modeling, combining and forecasting. [rpy2](https://pypi.org/project/rpy2/) is needed as an interface to use R from Python.# System requirements
- `Spark >= 2.3.1`
- `Python >= 3.7.0`
- `pyspark >= 2.3.1`
- `rpy2 >= 3.0.4`
- `scikit-learn >= 0.21.2`
- `numpy >= 1.16.3`
- `pandas >= 0.23.4`
- `R >= 3.5.2`
- `forecast >= 8.5`
- `polynom = 1.3.9`
- `dplyr >= 0.8.4`
- `quantmod >= 0.4.13`
- `magrittr >= 1.5`# Usage
## DARIMA
Run the [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) code to forecast the time series of the GEFCom2017 by utilizing DARIMA.```sh
./bash/run_darima.sh
```
or simply run
```py
PYSPARK_PYTHON=/usr/local/bin/python3.7 ARROW_PRE_0_15_IPC_FORMAT=1 spark-submit ./run_darima.py
```
**Note**: `ARROW_PRE_0_15_IPC_FORMAT=1` is added to instruct `PyArrow >= 0.15.0` to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.## ARIMA
Run the R code to forecast the time series of the GEFCom2017 by utilizing the `auto.arima()` function (used for comparison).
```sh
./bash/auto_arima.sh
```
or simply run
```r
Rscript auto_arima.R
```# References
- [Xiaoqian Wang](https://xqnwang.rbind.io), [Yanfei Kang](https://yanfei.site), [Rob J Hyndman](https://robjhyndman.com), & [Feng Li](https://feng.li/) (2022) Distributed ARIMA models for ultra-long time series. International Journal of Forecasting [DOI: 10.1016/j.ijforecast.2022.05.001](https://doi.org/10.1016/j.ijforecast.2022.05.001).