{"id":13547026,"url":"https://github.com/RolnickLab/climart","last_synced_at":"2025-04-02T19:32:17.030Z","repository":{"id":37712308,"uuid":"411840707","full_name":"RolnickLab/climart","owner":"RolnickLab","description":"A benchmark dataset for Machine Learning emulation of atmospheric radiative transfer in weather and climate models (NeurIPS 2021 Datasets and Benchmarks Track)","archived":false,"fork":false,"pushed_at":"2022-11-29T07:09:16.000Z","size":1135,"stargazers_count":40,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-03T15:38:25.846Z","etag":null,"topics":["atmospheric-science","climart","climate-change","dataset","distributional-shift","emulation","machine-learning","neural-networks","pytorch","radiative-transfer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RolnickLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-29T21:57:07.000Z","updated_at":"2024-09-11T15:08:58.000Z","dependencies_parsed_at":"2022-09-16T06:10:25.485Z","dependency_job_id":null,"html_url":"https://github.com/RolnickLab/climart","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RolnickLab%2Fclimart","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RolnickLab%2Fclimart/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RolnickLab%2Fclimart/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RolnickLab%2Fclimart/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RolnickLab","download_url":"https://codeload.github.com/RolnickLab/climart/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246880171,"owners_count":20848819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atmospheric-science","climart","climate-change","dataset","distributional-shift","emulation","machine-learning","neural-networks","pytorch","radiative-transfer"],"created_at":"2024-08-01T12:00:49.724Z","updated_at":"2025-04-02T19:32:15.980Z","avatar_url":"https://github.com/RolnickLab.png","language":"Python","funding_links":[],"categories":["Climate"],"sub_categories":[],"readme":"# ***ClimART*** - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models\n\u003ca href=\"https://pytorch.org/get-started/locally/\"\u003e\u003cimg alt=\"Python\" src=\"https://img.shields.io/badge/-Python 3.7--3.9-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pytorch.org/get-started/locally/\"\u003e\u003cimg alt=\"PyTorch\" src=\"https://img.shields.io/badge/-PyTorch 1.8.1+-ee4c2c?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pytorchlightning.ai/\"\u003e\u003cimg alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-792ee5?style=for-the-badge\u0026logo=pytorchlightning\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://hydra.cc/\"\u003e\u003cimg alt=\"Config: hydra\" src=\"https://img.shields.io/badge/config-hydra-89b8cd?style=for-the-badge\u0026labelColor=gray\"\u003e\u003c/a\u003e\n![CC BY 4.0][cc-by-image]\n\n[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png\n[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg\n\n## Official PyTorch Implementation\n\n### Using deep learning to optimise radiative transfer calculations.\n\nOur NeurIPS 2021 Datasets Track paper: https://arxiv.org/abs/2111.14671\n\nAbstract:   *Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive.  This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than **10 million** samples from present, pre-industrial, and future climate conditions, based on the Canadian Earth System Model.\nClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed. We also present several novel baselines that indicate shortcomings of datasets and network architectures used in prior work.*\n\n**Contact:** Venkatesh Ramesh [(venka97 at gmail)](mailto:venka97@gmail.com) or Salva Rühling Cachay [(salvaruehling at gmail)](mailto:salvaruehling@gmail.com). \u003cbr\u003e\n\n## Overview:\n\n* ``climart/``: Package with the main code, baselines and ML training logic.\n* ``analysis/``: Scripts to create visualization of the results (requires logging).\n* ``configs/``: Yaml configuration files for Hydra that define in a modular way (hyper-)parameters.\n\n## Getting Started\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Requirements\u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid red;\"\u003e\n    \u003cul\u003e\n    \u003cli\u003eLinux and Windows are supported, but we recommend Linux for performance and compatibility reasons.\u003c/li\u003e\n    \u003cli\u003eNVIDIA GPUs with at least 8 GB of memory and system with 12 GB RAM (More RAM is required if training with --load_train_into_mem option which allows for faster training). We have done all testing and development using NVIDIA V100 GPUs.\u003c/li\u003e \n    \u003cli\u003e64-bit Python \u003e=3.7 and PyTorch \u003e=1.8.1. See https://pytorch.org/ for PyTorch install instructions.\u003c/li\u003e \n    \u003cli\u003ePython libraries mentioned in ``env.yml`` file, see Getting Started (Need to have miniconda/conda installed).\u003c/li\u003e \n    \u003c/ul\u003e\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Downloading the ClimART Dataset \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    By default, only a subset of CLimART is downloaded.\n    To download the train/val/test years you want, please change the loop in ``data_download.sh.`` appropriately.\n    To download the whole ClimART dataset, you can simply run \n    \n    sudo bash download_climart.sh \n   \u003c/p\u003e\n\u003c/details\u003e\n       \n  **Note:** If you have issues with downloading the data please let us know to help you.\n\n    conda env create -f env.yml   # create new environment will all dependencies\n    conda activate climart  # activate the environment called 'climart'\n    sudo bash download_data_subset.sh  # download the dataset (or a subset of it, see above)\n    python run.py trainer.gpus=0 datamodule.train_years=\"2000\" # train a MLP emulator on 2000\n\n## Data Structure\n\nTo avoid storage redundancy, we store one single input array for both pristine- and clear-sky conditions. The dimensions of ClimART’s input arrays are:\n\u003cul\u003e\n\u003cli\u003elayers: (N, 49, D-lay) \u003c/li\u003e\n\u003cli\u003elevels: (N, 50, 4) \u003c/li\u003e\n\u003cli\u003eglobals: (N, 82) \u003c/li\u003e\n\u003c/ul\u003e\n\nwhere N is the data dimension (i.e. the number of examples of a specific year, or, during training, of a batch),\n 49 and 50 are the number of layers and levels in a column respectively. Dlay, 4, 82 is the number of features/channels for layers, levels, globals respectively. \n\nFor pristine-sky Dlay = 14, while for clear-sky Dlay = 45, since it contains extra aerosol related variables. The array for pristine-sky conditions can be easily accessed by slicing the first 14 features out of the stored array, e.g.:\n```      pristine_array = layers_array[:, :, : 14] ```. This is automatically done for you when you set the atmospheric\ncondition type via ```datamodule.exp_type=pristine``` or ```datamodule.exp_type=clear_sky```.\n\n\n## Baselines\n\nTo reproduce our paper results (for seed = 7), you may choose any of our pre-defined configs in the\n [configs/model](configs/model) folder and train it as follows\n \n ```\n# You can replace mlp with \"graphnet\", \"gcn\", or \"cnn\" to run a different ML model\n# To train on the CPU, choose trainer.gpus=0\n# Specify the directory where the CLimART data is saved with datamodule.data_dir=\"\u003cyour-data-dir\u003e\"\n# Test on the OOD subsets by setting arg datamodule.{test_ood_historic, test_ood_1991, test_ood_future}=True\npython run.py seed=7 model=mlp trainer.gpus=1 \n```\n\nTo reproduce the exact CNN model used in the paper, you can use the following command:\n```\npython run.py experiment=reproduce_paper2021_cnn seed=7    # feel free to run for more/other seeds\n```\nNote: You can also take a look at \n[this WandB report](https://wandb.ai/salv47/ClimART-public-runs/reports/ClimART-paper-CNN-runs--VmlldzozMDUyOTUy)\nwhich shows the results of three runs of the CNN model from the paper.\n\n### Inference\nCheck out [this notebook](notebooks/2022-06-06-get-predictions-pl.ipynb) for simple code on how to extract the predictions\nfor each target variable from a trained model (for arbitrary years of the ClimART dataset).\n\n## Tips\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Reproducibility \u0026 Data Generation code \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    To best reproduce our baselines and experiments and/or look into how the ClimART dataset was created/designed,\n    have a look at our `research_code` branch. It operates on pure PyTorch and has a less clean interface/code \n    than our main branch -- if you have any questions, let us know!\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Testing on OOD data subsets \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    By default tests run on the main test dataset only (2007-14), to test on the \n    historic, future or anomaly test subsets you need to pass/change the arg\n    \u003ccode\u003edatamodule.test_ood_historic=True\u003c/code\u003e (and/or \u003ccode\u003etest_ood_future=True\u003c/code\u003e, \u003ccode\u003etest_ood_1991=True\u003c/code\u003e),\n     besides downloading those data files, e.g. via the \u003ccode\u003edownload_climart.sh\u003c/code\u003e script.\n\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Overriding nested Hydra config groups \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    Nested config groups need to be overridden with a different notation - not with a dot, since it would be interpreted as a string otherwise.\n    For example, if you want to change the optimizer in the model you want to train, you should run:\n    \u003ccode\u003epython run.py  model=graphnet  optimizer@model.optimizer=SGD\u003c/code\u003e\n    \u003cbr\u003e\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Local configurations \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    You can easily use a local config file (that,e.g., overrides data paths, working dir etc.), by putting such a yaml config\n    in the configs/local subdirectory (Hydra searches for \u0026 uses by default the file configs/local/default.yaml, if it exists)\n\u003c/p\u003e\u003c/details\u003e   \n    \n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Wandb \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    If you use Wandb, make sure to select the \"Group first prefix\" option in the panel settings of the web app.\n    This will make it easier to browse through the logged metrics.\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003cp\u003e\n    \u003csummary\u003e\u003cb\u003e Credits \u0026 Resources \u003c/b\u003e\u003c/summary\u003e\n    \u003cp style=\"padding: 10px; border: 2px solid #ff0000;\"\u003e\n    The following template was extremely useful for getting started with the PL+Hydra implementation:\n    [ashleve/lightning-hydra-template](https://github.com/ashleve/lightning-hydra-template)\n\u003c/p\u003e\u003c/details\u003e\n\n\n\n## License: \nThis work is made available under [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/legalcode) license. ![CC BY 4.0][cc-by-shield]\n\n## Development\n\nThis repository is currently under active development and you may encounter bugs with some functionality. \nAny feedback, extensions \u0026 suggestions are welcome!\n\n\n## Citation\nIf you find ClimART or this repository helpful, feel free to cite our publication:\n\n    @inproceedings{cachay2021climart,\n        title={{ClimART}: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models},\n        author={Salva R{\\\"u}hling Cachay and Venkatesh Ramesh and Jason N. S. Cole and Howard Barker and David Rolnick},\n        booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},\n        year={2021},\n        url={https://arxiv.org/abs/2111.14671}\n    }","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRolnickLab%2Fclimart","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRolnickLab%2Fclimart","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRolnickLab%2Fclimart/lists"}