Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/xqnwang/darima

Distributed ARIMA Models
https://github.com/xqnwang/darima

arima distributed-computing spark time-series-forecasting

Last synced: 3 months ago
JSON representation

Distributed ARIMA Models

Host: GitHub
URL: https://github.com/xqnwang/darima
Owner: xqnwang
Created: 2020-06-30T07:33:48.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-10-17T01:09:21.000Z (over 2 years ago)
Last Synced: 2024-08-01T16:36:08.123Z (6 months ago)
Topics: arima, distributed-computing, spark, time-series-forecasting
Language: Python
Homepage:
Size: 5.8 MB
Stars: 27
Watchers: 7
Forks: 24
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # `darima`

Distributed ARIMA Models implemented with Apache Spark

# Introduction

DARIMA is designed to facilitate forecasting ultra-long time series by utilizing the industry-standard MapReduce framework. The algorithm is developed on Spark platform and both Python as well as R interfaces.

See [`darima`](darima) for developed functions used for implementing DARIMA models.

- [`model.py`](darima/model.py) : Train ARIMA models for each subseries and convert the trained models into AR representations (Mapper).

- [`dlsa.py`](darima/dlsa.py) : Combine the local estimators obtained in Mapper by minimizing the global loss function (Reducer).

- [`forecast.py`](darima/forecast.py) : Forecast the next H observations by utilizing the combined estimators.

- [`evaluation.py`](darima/evaluation.py) : Calculate the forecasting accuracy in terms as MASE, sMAPE and MSIS.

- [`R`](darima/R) : R functions designed for modeling, combining and forecasting. [rpy2](https://pypi.org/project/rpy2/) is needed as an interface to use R from Python.

# System requirements

- `Spark >= 2.3.1`

- `Python >= 3.7.0`

    - `pyspark >= 2.3.1`

    - `rpy2 >= 3.0.4`

    - `scikit-learn >= 0.21.2`

    - `numpy >= 1.16.3`

    - `pandas >= 0.23.4`

- `R >= 3.5.2`

    - `forecast >= 8.5`

    - `polynom = 1.3.9`

    - `dplyr >= 0.8.4`

    - `quantmod >= 0.4.13`

    - `magrittr >= 1.5`

# Usage

## DARIMA

Run the [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) code to forecast the time series of the GEFCom2017 by utilizing DARIMA.

```sh

  ./bash/run_darima.sh

```

or simply run

```py

  PYSPARK_PYTHON=/usr/local/bin/python3.7 ARROW_PRE_0_15_IPC_FORMAT=1 spark-submit ./run_darima.py

```

**Note**: `ARROW_PRE_0_15_IPC_FORMAT=1` is added to instruct `PyArrow >= 0.15.0` to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.

## ARIMA

Run the R code to forecast the time series of the GEFCom2017 by utilizing the `auto.arima()` function (used for comparison).

```sh

  ./bash/auto_arima.sh

```

or simply run

```r

  Rscript auto_arima.R

```

# References

- [Xiaoqian Wang](https://xqnwang.rbind.io), [Yanfei Kang](https://yanfei.site), [Rob J Hyndman](https://robjhyndman.com), & [Feng Li](https://feng.li/) (2022) Distributed ARIMA models for ultra-long time series. International Journal of Forecasting [DOI: 10.1016/j.ijforecast.2022.05.001](https://doi.org/10.1016/j.ijforecast.2022.05.001).