Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/pangeo-data/WeatherBench

A benchmark dataset for data-driven weather forecasting
https://github.com/pangeo-data/WeatherBench
benchmark dataset deep-learning weather-forecast
Last synced: 24 days ago
JSON representation
A benchmark dataset for data-driven weather forecasting
Host: GitHub
URL: https://github.com/pangeo-data/WeatherBench
Owner: pangeo-data
License: mit
Created: 2019-09-17T08:49:33.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-12-08T09:54:39.000Z (7 months ago)
Last Synced: 2024-06-05T19:21:40.443Z (25 days ago)
Topics: benchmark, dataset, deep-learning, weather-forecast
Language: Jupyter Notebook
Homepage:
Size: 17.4 MB
Stars: 668
Watchers: 31
Forks: 166
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-earth-system-ml - **WeatherBench: A benchmark dataset for data-driven weather forecasting**
Awesome-TimeSeries-SpatioTemporal-LM-LLM - \[link\
awesome-WeatherAI - WeatherBench
README

        ![Logo](https://github.com/ai4environment/WeatherBench/blob/master/figures/logo_text_left.png?raw=true)

# WeatherBench: A benchmark dataset for data-driven weather forecasting

**🚨🚨🚨 [WeatherBench 2](https://github.com/google-research/weatherbench2) has been released. It provides an updated and much improved benchmark including more comprehensive and more easily accessible datasets.🚨🚨🚨**

[![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/pangeo-data/WeatherBench/master?filepath=quickstart.ipynb)

If you are using this dataset please cite 

> Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, and Nils Thuerey, 2020.

> WeatherBench: A benchmark dataset for data-driven weather forecasting.

> arXiv: [https://arxiv.org/abs/2002.00469](https://arxiv.org/abs/2002.00469)

This repository contains all the code for downloding and processing the data as well as code for the baseline models

 in the paper.

 

 ---

 *Note!

 The data has been changed from the original release. Here is a list of changes:*

 - *New vertical levels. Used to be [1, 10, 100, 200, 300, 400, 500, 600, 700, 850,

1000], now is [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]. This is to be compatible with CMIP output. The new levels include all of the old ones with the exception of [1, 10].*

- *CMIP data. Regridded CMIP data of some variables was added. This is the historical simulation of the MPI-ESM-HR model.*

 ---

 

If you have any questions about this dataset, please use the [Github Issue](https://github.com/pangeo-data/WeatherBench/issues) feature on this page! 

## Leaderboard

| Model | Z500 RMSE (3 / 5 days) [m²/s²] | T850 RMSE (3 / 5 days) [K] | Notes | Reference |

|--------------------|----------------------------------|----------------------------|----------------------|------------------|

| Operational IFS | 154 / 334 | 1.36 / 2.03 | ECWMF physical model (10 km) | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

| Rasp and Thuerey 2020 (direct/continuous) | **268 / 499** | **1.65 / 2.41** | Resnet with CMIP pretraining (5.625 deg) | [Rasp and Thuerey 2020](http://arxiv.org/abs/2008.08626) |

| IFS T63 | 268 / 463 | 1.85 / 2.52 | Lower resolution physical model (approx. 1.9 deg) | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

| Weyn et al. 2020 (iterative) | **373 / 611** | **1.98 / 2.87** | UNet with cube-sphere mapping (2 deg) | [Weyn et al. 2020](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2020MS002109) |

| Clare et al. 2021 (direct) | 375 / 627 | 2.11 / 2.91 | Stacked ResNets with probabilistic output (5.625 deg) | [Clare et al. 2021](https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/qj.4180) |

| IFS T42 | 489 / 743 | 3.09 / 3.83 |Lower resolution physical model (approx. 2.8 deg)  | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

| Weekly climatology | 816 | 3.50 | Climatology for each calendar week | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

| Persistence | 936 / 1033 | 4.23 / 4.56 |  | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

| Climatology | 1075 | 5.51 |  | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

## Quick start

You can follow the quickstart guide in [this notebook](https://github.com/pangeo-data/WeatherBench/blob/master/quickstart.ipynb) or lauch it directly from [Binder](https://binder.pangeo.io/v2/gh/pangeo-data/WeatherBench/master?filepath=quickstart.ipynb).

## Download the data

The data is hosted [here](https://mediatum.ub.tum.de/1524895) with the following directory structure

```

.

|-- 1.40625deg

|   |-- 10m_u_component_of_wind

|   |-- 10m_v_component_of_wind

|   |-- 2m_temperature

|   |-- constants

|   |-- geopotential

|   |-- old

|   |   `-- temperature

|   |-- potential_vorticity

|   |-- relative_humidity

|   |-- specific_humidity

|   |-- temperature

|   |-- toa_incident_solar_radiation

|   |-- total_cloud_cover

|   |-- total_precipitation

|   |-- u_component_of_wind

|   |-- v_component_of_wind

|   `-- vorticity

|-- 2.8125deg

|   |-- 10m_u_component_of_wind

|   |-- 10m_v_component_of_wind

|   |-- 2m_temperature

|   |-- constants

|   |-- geopotential

|   |-- potential_vorticity

|   |-- relative_humidity

|   |-- specific_humidity

|   |-- temperature

|   |-- toa_incident_solar_radiation

|   |-- total_cloud_cover

|   |-- total_precipitation

|   |-- u_component_of_wind

|   |-- v_component_of_wind

|   `-- vorticity

|-- 5.625deg

|   |-- 10m_u_component_of_wind

|   |-- 10m_v_component_of_wind

|   |-- 2m_temperature

|   |-- constants

|   |-- geopotential

|   |-- geopotential_500

|   |-- potential_vorticity

|   |-- relative_humidity

|   |-- specific_humidity

|   |-- temperature

|   |-- temperature_850

|   |-- toa_incident_solar_radiation

|   |-- total_cloud_cover

|   |-- total_precipitation

|   |-- u_component_of_wind

|   |-- v_component_of_wind

|   `-- vorticity

|-- baselines

|   `-- saved_models

|-- CMIP

|   `-- MPI-ESM

|       |-- 2.8125deg

|       |   |-- geopotential

|       |   |-- specific_humidity

|       |   |-- temperature

|       |   |-- u_component_of_wind

|       |   `-- v_component_of_wind

|       `-- 5.625deg

|           |-- geopotential

|           |-- specific_humidity

|           |-- temperature

|           |-- u_component_of_wind

|           `-- v_component_of_wind

|-- IFS_T42

|   `-- raw

|-- IFS_T63

|   `-- raw

`-- tigge

    |-- 1.40625deg

    |   |-- geopotential_500

    |   `-- temperature_850

    |-- 2.8125deg

    |   |-- geopotential_500

    |   `-- temperature_850

    `-- 5.625deg

        |-- 2m_temperature

        |-- geopotential_500

        |-- temperature_850

        `-- total_precipitation

```

To start out download either the entire 5.625 degree data (175G) using 

```shell

wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg&files=all_5.625deg.zip" -O all_5.625deg.zip

```

or simply the single level (500 hPa) geopotential data using

```shell

wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg%2Fgeopotential_500&files=geopotential_500_5.625deg.zip" -O geopotential_500_5.625deg.zip

```

and then unzip the files using `unzip .zip`. You can also use `ftp` or `rsync` to download the data. For instructions, follow the [download link](https://mediatum.ub.tum.de/1524895).

## Baselines and evaluation

 **IMPORTANT:** The format of the predictions file is a

  NetCDF dataset with dimensions `[init_time, lead_time, lat, lon]`. Consult the notebooks for examples. You are

   stongly encouraged to format your predictions in the same way and then use the same evaluation functions to ensure

    consistent evaluation.

### Baselines

The baselines are created using Jupyter notebooks in `notebooks/`. In all notebooks, the forecasts are saved as a

 NetCDF file in the `predictions` directory of the dataset. 

 

### CNN baselines

An example of how to load the data and train a CNN using Keras is given in `notebooks/3-cnn-example.ipynb`. In

 addition a command line script for training CNNs is provided in `src/train_nn.py`. For the baseline CNNs in the

  paper the config files are given in `src/nn_configs/`. To reproduce the results in the paper run e.g. `python -m src.train_nn -c src/nn_configs/fccnn_3d.yml`. 

  

### Evaluation

Evaluation and comparison of the different baselines in done in `notebooks/4-evaluation.ipynb`. The scoring is done

 using the functions in `src/score.py`. The RMSE values for the baseline models are also saved in the `predictions

 ` directory of the dataset. This is useful for plotting your own models alongside the baselines.

## Data processing

The dataset already contains the most important processed data. If you would like to download a different variable

, regrid to a different resolution or extract single levels from the 3D files, here is how to do that!

### Downloading and processing the raw data from the ERA5 archive

The workflow to get to the processed data that ended up in the data repository above is: 

1. Download monthly files from the ERA5 archive (`src/download.py`)

2. Regrid the raw data to the required resolutions (`src/regrid.py`)

The raw data is from the ERA5 reanalysis archive. Information on how to download the data can be found 

[here](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5) and 

[here](https://cds.climate.copernicus.eu/api-how-to). 

Because downloading the data can take a long time (several weeks), the workflow is encoded using [Snakemake](https://snakemake.readthedocs.io/). See `Snakefile` and the configuration files for each variable in `scripts/config_

{variable}.yml`. These

 files can be modified if additional variables are required. To execute Snakemake for a particular variable type

 : `snakemake -p -j 4 all --configfile scripts/config_toa_incident_solar_radiation.yml`.

 

In addition to the time-dependent fields, the constant fields were downloaded and processed using `scripts

/download_and_regrid_constants.sh`

 

### Downloading the TIGGE IFS baseline

To obtain the operational IFS baseline, we use the [TIGGE Archive](https://confluence.ecmwf.int/display/TIGGE

). Downloading the data for Z500 and T850 is done in `scripts/download_tigge.py`; regridding is done in `scripts

/convert_and_regrid_tigge.sh`.

### Regridding the T21 IFS baseline

The T21 baseline was created by Peter Dueben. The raw output can be found in the dataset. To regrid the data `scripts

/convert_and_regrid_IFS_TXX.sh` was used.

### Downloading and regridding CMIP historical climate model data.

To download historical climate model data use the Snakemake file in `snakemake_configs_CMIP`. Here, we downloaded data from the `MIP-ESM-HR` model. To download other models, search for the download links on the CMIP website and modify the scripts accordingly.

### Extracting single levels from 3D files

If you would like to extract a single level from 3D data, e.g. 850 hPa temperature, you can use `src

/extract_level.py`. This could be useful to reduce the amount of data that needs to be loaded into RAM. An example

 usage would be: `python extract_level.py --input_fns DATADIR/5.625deg/temperature/*.nc --output_dir OUTDIR --level 850`