Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/pangeo-data/WeatherBench

A benchmark dataset for data-driven weather forecasting
https://github.com/pangeo-data/WeatherBench

benchmark dataset deep-learning weather-forecast

Last synced: 24 days ago
JSON representation

A benchmark dataset for data-driven weather forecasting

Lists

README

        

![Logo](https://github.com/ai4environment/WeatherBench/blob/master/figures/logo_text_left.png?raw=true)
# WeatherBench: A benchmark dataset for data-driven weather forecasting

**🚨🚨🚨 [WeatherBench 2](https://github.com/google-research/weatherbench2) has been released. It provides an updated and much improved benchmark including more comprehensive and more easily accessible datasets.🚨🚨🚨**

[![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/pangeo-data/WeatherBench/master?filepath=quickstart.ipynb)

If you are using this dataset please cite
> Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, and Nils Thuerey, 2020.
> WeatherBench: A benchmark dataset for data-driven weather forecasting.
> arXiv: [https://arxiv.org/abs/2002.00469](https://arxiv.org/abs/2002.00469)

This repository contains all the code for downloding and processing the data as well as code for the baseline models
in the paper.

---
*Note!
The data has been changed from the original release. Here is a list of changes:*
- *New vertical levels. Used to be [1, 10, 100, 200, 300, 400, 500, 600, 700, 850,
1000], now is [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]. This is to be compatible with CMIP output. The new levels include all of the old ones with the exception of [1, 10].*
- *CMIP data. Regridded CMIP data of some variables was added. This is the historical simulation of the MPI-ESM-HR model.*
---

If you have any questions about this dataset, please use the [Github Issue](https://github.com/pangeo-data/WeatherBench/issues) feature on this page!

## Leaderboard
| Model | Z500 RMSE (3 / 5 days) [m2/s2] | T850 RMSE (3 / 5 days) [K] | Notes | Reference |
|--------------------|----------------------------------|----------------------------|----------------------|------------------|
| Operational IFS | 154 / 334 | 1.36 / 2.03 | ECWMF physical model (10 km) | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |
| Rasp and Thuerey 2020 (direct/continuous) | **268 / 499** | **1.65 / 2.41** | Resnet with CMIP pretraining (5.625 deg) | [Rasp and Thuerey 2020](http://arxiv.org/abs/2008.08626) |
| IFS T63 | 268 / 463 | 1.85 / 2.52 | Lower resolution physical model (approx. 1.9 deg) | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |
| Weyn et al. 2020 (iterative) | **373 / 611** | **1.98 / 2.87** | UNet with cube-sphere mapping (2 deg) | [Weyn et al. 2020](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2020MS002109) |
| Clare et al. 2021 (direct) | 375 / 627 | 2.11 / 2.91 | Stacked ResNets with probabilistic output (5.625 deg) | [Clare et al. 2021](https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/qj.4180) |
| IFS T42 | 489 / 743 | 3.09 / 3.83 |Lower resolution physical model (approx. 2.8 deg) | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |
| Weekly climatology | 816 | 3.50 | Climatology for each calendar week | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |
| Persistence | 936 / 1033 | 4.23 / 4.56 | | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |
| Climatology | 1075 | 5.51 | | [Rasp et al. 2020](https://arxiv.org/abs/2002.00469) |

## Quick start
You can follow the quickstart guide in [this notebook](https://github.com/pangeo-data/WeatherBench/blob/master/quickstart.ipynb) or lauch it directly from [Binder](https://binder.pangeo.io/v2/gh/pangeo-data/WeatherBench/master?filepath=quickstart.ipynb).

## Download the data
The data is hosted [here](https://mediatum.ub.tum.de/1524895) with the following directory structure

```
.
|-- 1.40625deg
| |-- 10m_u_component_of_wind
| |-- 10m_v_component_of_wind
| |-- 2m_temperature
| |-- constants
| |-- geopotential
| |-- old
| | `-- temperature
| |-- potential_vorticity
| |-- relative_humidity
| |-- specific_humidity
| |-- temperature
| |-- toa_incident_solar_radiation
| |-- total_cloud_cover
| |-- total_precipitation
| |-- u_component_of_wind
| |-- v_component_of_wind
| `-- vorticity
|-- 2.8125deg
| |-- 10m_u_component_of_wind
| |-- 10m_v_component_of_wind
| |-- 2m_temperature
| |-- constants
| |-- geopotential
| |-- potential_vorticity
| |-- relative_humidity
| |-- specific_humidity
| |-- temperature
| |-- toa_incident_solar_radiation
| |-- total_cloud_cover
| |-- total_precipitation
| |-- u_component_of_wind
| |-- v_component_of_wind
| `-- vorticity
|-- 5.625deg
| |-- 10m_u_component_of_wind
| |-- 10m_v_component_of_wind
| |-- 2m_temperature
| |-- constants
| |-- geopotential
| |-- geopotential_500
| |-- potential_vorticity
| |-- relative_humidity
| |-- specific_humidity
| |-- temperature
| |-- temperature_850
| |-- toa_incident_solar_radiation
| |-- total_cloud_cover
| |-- total_precipitation
| |-- u_component_of_wind
| |-- v_component_of_wind
| `-- vorticity
|-- baselines
| `-- saved_models
|-- CMIP
| `-- MPI-ESM
| |-- 2.8125deg
| | |-- geopotential
| | |-- specific_humidity
| | |-- temperature
| | |-- u_component_of_wind
| | `-- v_component_of_wind
| `-- 5.625deg
| |-- geopotential
| |-- specific_humidity
| |-- temperature
| |-- u_component_of_wind
| `-- v_component_of_wind
|-- IFS_T42
| `-- raw
|-- IFS_T63
| `-- raw
`-- tigge
|-- 1.40625deg
| |-- geopotential_500
| `-- temperature_850
|-- 2.8125deg
| |-- geopotential_500
| `-- temperature_850
`-- 5.625deg
|-- 2m_temperature
|-- geopotential_500
|-- temperature_850
`-- total_precipitation
```

To start out download either the entire 5.625 degree data (175G) using
```shell
wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg&files=all_5.625deg.zip" -O all_5.625deg.zip
```
or simply the single level (500 hPa) geopotential data using
```shell
wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg%2Fgeopotential_500&files=geopotential_500_5.625deg.zip" -O geopotential_500_5.625deg.zip
```
and then unzip the files using `unzip .zip`. You can also use `ftp` or `rsync` to download the data. For instructions, follow the [download link](https://mediatum.ub.tum.de/1524895).

## Baselines and evaluation
**IMPORTANT:** The format of the predictions file is a
NetCDF dataset with dimensions `[init_time, lead_time, lat, lon]`. Consult the notebooks for examples. You are
stongly encouraged to format your predictions in the same way and then use the same evaluation functions to ensure
consistent evaluation.
### Baselines
The baselines are created using Jupyter notebooks in `notebooks/`. In all notebooks, the forecasts are saved as a
NetCDF file in the `predictions` directory of the dataset.

### CNN baselines
An example of how to load the data and train a CNN using Keras is given in `notebooks/3-cnn-example.ipynb`. In
addition a command line script for training CNNs is provided in `src/train_nn.py`. For the baseline CNNs in the
paper the config files are given in `src/nn_configs/`. To reproduce the results in the paper run e.g. `python -m src.train_nn -c src/nn_configs/fccnn_3d.yml`.

### Evaluation
Evaluation and comparison of the different baselines in done in `notebooks/4-evaluation.ipynb`. The scoring is done
using the functions in `src/score.py`. The RMSE values for the baseline models are also saved in the `predictions
` directory of the dataset. This is useful for plotting your own models alongside the baselines.

## Data processing
The dataset already contains the most important processed data. If you would like to download a different variable
, regrid to a different resolution or extract single levels from the 3D files, here is how to do that!

### Downloading and processing the raw data from the ERA5 archive

The workflow to get to the processed data that ended up in the data repository above is:
1. Download monthly files from the ERA5 archive (`src/download.py`)
2. Regrid the raw data to the required resolutions (`src/regrid.py`)

The raw data is from the ERA5 reanalysis archive. Information on how to download the data can be found
[here](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5) and
[here](https://cds.climate.copernicus.eu/api-how-to).

Because downloading the data can take a long time (several weeks), the workflow is encoded using [Snakemake](https://snakemake.readthedocs.io/). See `Snakefile` and the configuration files for each variable in `scripts/config_
{variable}.yml`. These
files can be modified if additional variables are required. To execute Snakemake for a particular variable type
: `snakemake -p -j 4 all --configfile scripts/config_toa_incident_solar_radiation.yml`.

In addition to the time-dependent fields, the constant fields were downloaded and processed using `scripts
/download_and_regrid_constants.sh`

### Downloading the TIGGE IFS baseline

To obtain the operational IFS baseline, we use the [TIGGE Archive](https://confluence.ecmwf.int/display/TIGGE
). Downloading the data for Z500 and T850 is done in `scripts/download_tigge.py`; regridding is done in `scripts
/convert_and_regrid_tigge.sh`.

### Regridding the T21 IFS baseline

The T21 baseline was created by Peter Dueben. The raw output can be found in the dataset. To regrid the data `scripts
/convert_and_regrid_IFS_TXX.sh` was used.

### Downloading and regridding CMIP historical climate model data.

To download historical climate model data use the Snakemake file in `snakemake_configs_CMIP`. Here, we downloaded data from the `MIP-ESM-HR` model. To download other models, search for the download links on the CMIP website and modify the scripts accordingly.

### Extracting single levels from 3D files

If you would like to extract a single level from 3D data, e.g. 850 hPa temperature, you can use `src
/extract_level.py`. This could be useful to reduce the amount of data that needs to be loaded into RAM. An example
usage would be: `python extract_level.py --input_fns DATADIR/5.625deg/temperature/*.nc --output_dir OUTDIR --level 850`