{"id":34922821,"url":"https://github.com/peterkuma/ml-clouds-2021","last_synced_at":"2026-04-15T14:38:56.554Z","repository":{"id":48328274,"uuid":"447961092","full_name":"peterkuma/ml-clouds-2021","owner":"peterkuma","description":"Code for the paper \"Machine learning of cloud types in satellite observations and climate models\".","archived":false,"fork":false,"pushed_at":"2023-03-06T19:32:34.000Z","size":339313,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-26T02:34:30.521Z","etag":null,"topics":["climate-model","climate-science","climate-sensitivity","clouds","machine-learning","tensorflow"],"latest_commit_sha":null,"homepage":"https://peterkuma.net/science/papers/kuma_et_al_2022a/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peterkuma.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-01-14T12:31:59.000Z","updated_at":"2024-04-20T16:16:54.000Z","dependencies_parsed_at":"2023-02-06T09:46:35.487Z","dependency_job_id":null,"html_url":"https://github.com/peterkuma/ml-clouds-2021","commit_stats":{"total_commits":264,"total_committers":1,"mean_commits":264.0,"dds":0.0,"last_synced_commit":"5c513b0b3333fb7ce0569285de5df1b20fc9bcea"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/peterkuma/ml-clouds-2021","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterkuma%2Fml-clouds-2021","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterkuma%2Fml-clouds-2021/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterkuma%2Fml-clouds-2021/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterkuma%2Fml-clouds-2021/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peterkuma","download_url":"https://codeload.github.com/peterkuma/ml-clouds-2021/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterkuma%2Fml-clouds-2021/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31846417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T13:28:40.153Z","status":"ssl_error","status_checked_at":"2026-04-15T13:28:29.396Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["climate-model","climate-science","climate-sensitivity","clouds","machine-learning","tensorflow"],"created_at":"2025-12-26T13:56:56.225Z","updated_at":"2026-04-15T14:38:56.536Z","avatar_url":"https://github.com/peterkuma.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Code for the paper \"Machine learning of cloud types in satellite observations and climate models\"\n\nPeter Kuma\u003csup\u003e1\u003c/sup\u003e,\nFrida A.-M. Bender\u003csup\u003e1\u003c/sup\u003e,\nAlex Schuddeboom\u003csup\u003e2\u003c/sup\u003e,\nAdrian J. McDonald\u003csup\u003e2\u003c/sup\u003e,\nØyvind Seland\u003csup\u003e3\u003c/sup\u003e\n\n\u003csup\u003e1\u003c/sup\u003eDepartment of Meteorology (MISU), Stockholm University, Stockholm, Sweden\\\n\u003csup\u003e2\u003c/sup\u003eSchool of Physical and Chemical Sciences, Christchurch, Aotearoa New Zealand\\\n\u003csup\u003e3\u003c/sup\u003eNorwegian Meteorological Institute, Oslo, Norway\n\nThis repository contains code for the paper\n[Machine learning of cloud types in satellite observations and climate\nmodels](https://peterkuma.net/science/papers/kuma_et_al_2022a/).\n\n## Introduction\n\nThe code contains scripts for training an artificial neural network (ANN) for\nprediction of cloud types based on satellite and ground-based station data, and\nrelated data processing and visualisation. The artificial neural network is\nimplemented in [TensorFlow](https://www.tensorflow.org/). The scripts use\nPython and bash. They should be run on Linux, although it may be possible to\nadapt them to other operating systems.\n\nThe scripts are divided into data processing scripts and plotting scripts.\nThe scripts are dependent on one another for data input and output. The script\n`run` in the main directory runs the individual data processing and plotting\nscripts located in the directory `bin` and takes care of the dependencies.\n\nStorage requirements for running the code on all available climate models and\nreanalyses are on the scale of 10 TB, and main memory requirements are on the\nscale of 60 GB. Lower hardware requirements are possible if used with fewer\nmodels or shorter training data time periods.\n\nPlease see the manuscript for more details about the ANN. If you have any\nquestions about the code or would like to report a bug, you can contact the\nmanuscript authors or submit an [issue on\nGitHub](https://github.com/peterkuma/ml-clouds-2021/issues). Contributions\nare welcome through [pull requests on\nGithub](https://github.com/peterkuma/ml-clouds-2021/pulls).\n\n## Requirements\n\nThe code can be run on a Linux distribution with the following software\n(exact versions are listed for reproducibility, but newer version may work\nequally well):\n\n- Python (3.9.2)\n- Cython (0.29.2)\n- aria2 (1.35.0)\n- GNU parallel (20161222)\n- cdo (1.9.10)\n\nas well as Python packages listed in `requirements.txt`.\n\nOn Debian-based Linux distributions (Ubuntu, Debian, Devuan, ...), the required\nsoftware can be installed with:\n\n```sh\napt install python3 cython3 aria2 parallel cdo\n```\n\nOptionally, to install the Python packages in a virtual environment (venv)\ninstead of the user's home directory:\n\n```sh\npython3 -m venv venv\n. venv/bin/activate\n```\n\nTo install the Python packages:\n\n```sh\npip3 install -r requirements.txt\n```\n\nDepending on the Linux distribution, python and pip might be available as\n`python` or `python3`, and `pip` or `pip3`.\n\n## Input datasets\n\nThe input datasets are not contained in this repository because of their large\nsize, except for surface temperature data and a table of model ECS, TCR and\ncloud feedback. The rest of the datasets have to be downloaded from other\nrepositories as described below. It is possible to run the scripts with fewer\nclimate models and reanalyses and thus reduce the amount of data needed to\nbe downloaded. The `run` script expects a particular subdirectory structure of\nthe `input` directory, described [below](#input-directory).\n\n### CERES\n\nSYN1deg Level 3 daily means can be downloaded from the [CERES\nwebsite](https://ceres.larc.nasa.gov). They have to be converted to NetCDF with\nthe tool\n[h4toh5](https://portal.hdfgroup.org/display/support/Download+h4h5tools)\n(provided by the HDF Group), and stored in `input/ceres`.\n\nAfter downloading, the data should be resampled to 2.5° resolution with\n[cdo](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html):\n\n```sh\n# To be run in the directory with the CERES NetCDF files.\nmkdir 2.5deg\nparallel cdo -remapcon,r144x96 {} 2.5deg/{} ::: *.nc\n```\n\n### Historical Unidata Internet Data Distribution (IDD) Global Observational Data\n\nThe IDD dataset contains ship and buoy records from the Global Telecommunication\nSystem. It can be downloaded from [Research Data\nArchive](https://rda.ucar.edu/datasets/ds336.0/). The relevant files are\nthe SYNOP and BUOY NetCDF files (2008–present), and the HISTSURFACEOBS tar\nfiles (2003–2008). The HISTSURFACEOBS files have to be unpacked after\ndownloading.\n\nIn the examples below it is assumed that the IDD NetCDF files are stored under\n`input/idd/synop` and `input/idd/buoy` for the synop and buoy files,\nrespectively.\n\n### Climate Model Intercomparison Project (CMIP)\n\nCMIP5 and CMIP6 model output can be downloaded from the\n[CMIP5](https://esgf-node.llnl.gov/projects/cmip5/) and\n[CMIP6](https://esgf-node.llnl.gov/projects/cmip6/) data archives.  The\nrelevant experiments are `historical` (`hist-1950` in the case of EC-Earth3P)\nand `abrupt-4xCO2`. The required variables are `tas` in the monthly (`mon`)\nfrequency, and `rlut`, `rlutcs`, `rsdt`, `rsut`, `rsutcs` in the daily (`day`)\nfrequency.\n\nThe command `download_cmip` can be used to create a list of CMIP files\nto download from a JSON catalog file, which can be created on the archive site\nabove (`return results as JSON` on the search page). `limit=` in the URL to the\nJSON file should be changed to 10000, and `Show All Replicas` should be\nselected when searching. The resulting file list can be used with the program\naria2c as `aria2c -i \u003cfile\u003e` to download the files. Afterwards, use the\ncommands `create_by_model` and `create_by_var` to create an index of symlinks\nin the directory where the downloaded files are stored. This index is required\nby the `run` script.\n\nIn the examples below it is assumed that the CMIP5 and CMIP6 files are stored\nin `input/cmip5/\u003cexperiment\u003e/\u003cfrequency/` and\n`input/cmip6/\u003cexperiment\u003e/\u003cfrequency/`, respectively, where `\u003cexperiment\u003e`\nis either `historical` (for both `historical` and `hist-1950`) or\n`abrupt-4xCO2` and `\u003cfrequency\u003e` is `day` or `mon`.\n\nThe data should be resampled to 2.5° resolution in the same way as the CERES\ndata.\n\n### GISS Surface Temperature Analysis (GISTEMP)\n\nThe GISTEMP dataset is in `data/gistemp` available as the original file\n(CSV) and converted to NetCDF with `gistemp_to_nc` (required by the main\ncommands). The original dataset was downloaded from [NASA\nGISS](https://data.giss.nasa.gov/gistemp/). The original terms of use apply.\n\n### ERA5\n\nERA5 hourly data on pressure levels from 1979 to present can be downloaded\nfrom the [Copernicus\nwebsite](https://cds.climate.copernicus.eu/#!/search?text=ERA5\u0026type=dataset).\nThey have to be converted to daily mean files with cdo and stored in\n`input/era5`.\n\nThe data should be resampled to 2.5° resolution in the same way as the CERES\ndata.\n\n### MERRA-2\n\nThe M2T1NXRAD MERRA-2 product can be downloaded from [NASA\nEarthData](https://disc.gsfc.nasa.gov/datasets?project=MERRA-2). Daily means\ncan be downloaded with the GES DISC Subsetter. They have to be stored in\n`input/merra-2`.\n\nThe data should be resampled to 2.5° resolution in the same way as the CERES\ndata.\n\n### Global mean near-surface temperature\n\nGlobal mean near-surface temperature datasets should be stored in `input/tas`.\nThey are present in this repository and need to be extracted from the\narchive `input/tas.tar.xz`.\n\n## Directories\n\n### Input directory\n\nThe `input` directory should contain the necessary input files. Apart from the\ndatasets already contained in this repository, the files need to be downloaded\nfrom the sources as described above. Models and reanalyses which are not\navailable should be removed from the `input/models_*` files before running\n`run`. Below is a description of the structure of the input directory\n(directories are marked with `/` at the end of the name).\n\n```\nceres/              CERES SYN1deg daily mean files (NetCDF).\n↳ 2.5deg/           The same as above, but resampled to 2.5°.\ncmip5               CMIP5 data files (NetCDF).\n↳ abrupt-4xCO2/     abrupt-4xCO2 experiment files.\n  ↳ day/            Daily mean files for the variables rlut, rlutcs, rsdt, rsut and rsutcs.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ by-model/   Directory created by create_by_model.\n    ↳ mon/          Monthly mean files for the variable tas.\ncmip6/              CMIP6 data files (NetCDF).\n↳ abrupt-4xCO2/     abrupt-4xCO2 experiment files.\n  ↳ day/            Daily mean files for the variables rlut, rlutcs, rsdt, rsut and rsutcs.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ by-model/   Directory created by create_by_model.\n  ↳ mon/            Daily mean files for the variable tas.\n↳ hist-1950/        hist-1950 expriment files for the EC-Earth3P model.\n  ↳ day/            Daily mean files for the variables rlut, rlutcs, rsdt, rsut and rsutcs.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ by-model/   Directory created by create_by_model.\n  ↳ mon/            Monthly mean files for tas.\n↳ historical/       historical experiment files.\n  ↳ day/            Daily mean files for the variables rlut, rlutcs, rsdt, rsut and rsutcs.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ by-model/   Directory created by create_by_model.\n  ↳ mon/            Monthly mean files for the variable tas.\necs/\n↳ ecs.csv           ECS, TCR and CLD values for the CMIP5 and CMIP6 models.\nera5/               Daily mean ERA5 NetCDF files with the following variables in each file: tisr, tsr, tsrc, ttr and ttrc.\n↳ 2.5deg/           The same as above, but resampled to 2.5°.\nidd/\n↳ buoy/             IDD buoy files (NetCDF).\n↳ synop/            IDD synop files (NetCDF).\nlandmask/\n↳ ne_110m_land.nc   Land-sea mask derived from Natural Earth data.\nmerra2/             Daily mean MERRA-2 NetCDF files of the M2T1NXRAD product with the following variables in each file: LWTUP, LWTUPCLR, SWTDN, SWTNT and SWTNTCLR.\n↳ 2.5deg/           The same as above, but resampled to 2.5°.\nnoresm2/            NorESM2 model files.\n↳ historical/\n  ↳ day/            Daily mean files.\n    ↳ \u003cvariable\u003e/   Daily mean NorESM NetCDF files in the historical experiment, where variable is FLNT, FLNTC, FLUT, FLUTC, FSNTOA, FSNTOAC and SOLIN.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ \u003cvariable\u003e/\n↳ abrupt-4xCO2/\n  ↳ day/            Daily mean files.\n    ↳ \u003cvariable\u003e/   Daily mean NorESM2 NetCDF files in the abrupt-4xCO2 experiment, where variable is FLNT, FLNTC, FLUT, FLUTC, FSNTOA, FSNTOAC and SOLIN.\n    ↳ 2.5deg/       The same as above, but resampled to 2.5°.\n      ↳ \u003cvariable\u003e/\ntas/                Near-surface air temperature. This should be extracted from tas.tar.xz.\n↳ historical/\n  ↳ CERES.nc        Near-surface air temperature from observations (GISTEMP).\n  ↳ \u003cmodel\u003e.nc      Near-surface air temperature of a model in the historical experiment.\n↳ abrupt-4xCO2/\n  ↳ CERES.nc        Near-surface air temperature from observations (GISTEMP).\n  ↳ \u003cmodel\u003e.nc      Near-surface air temperature of a model in the abrupt-4xCO2 experiment.\nmodels_*            Files containing a list of models to be processed. Available in this repository.\ntas.tar.xz          Near-surface air temperature (compressed archive). Available in this repository.\n```\n\n### Data directory\n\nOutput from the processing commands is written in the data directory\n(`data_4`, `data_10` and `data_27` for 4, 10 and 27 cloud types). In addition,\na common data directory (`data`) stores data common to all cloud type sets.\nBelow is a description of its structure (this is created automatically by the\n`run` script during the data processing):\n\n```\nann\n↳ ceres.h5           ANN model generated by ann train (HDF5).\n↳ history.nc         ANN model training history file (NetCDF).\ncto_ecs\n↳ cto_ecs.nc         Cloud type occurrence vs. ECS calculated by calc_cto_ecs (NetCDF).\ndtau_pct\n↳ dtau_pct.nc        Histogram calculated by calc_dtau_pct (NetCDF).\ngeo_cto              Geographical distribution files for models and CERES calculated by calc_geo_cto.\n↳ abrupt-4xCO2       CMIP5 and CMIP6 abrupt-4xCO2 experiment.\n  ↳ all              All models.\n  ↳ part_1           Models for the first figure.\n  ↳ part_2           Models for the continued figure.\n↳ historical         CMIP6 historical experiment.\n  ↳ all              All models.\n  ↳ part_1           Models for the first figure.\n  ↳ part_2           Models for the continued figure.\nidd_geo              Geographical distribution files for IDD calculated by calc_idd_cto.\nidd_sample           Sample IDD files for plotting stations.\nsamples              A symbolic link to data/samples.\n↳ ceres\n  ↳ \u003cyear\u003e           CERES samples generated by prepare_samples.\n  ↳ \u003cyear\u003e.nc        Merged samples for a given year (NetCDF).\n  ↳ training         Symbolic links to the training years in the parent directory.\n  ↳ validation       Symbolic links to the validation years in the parent directory.\n↳ abrupt-4xCO2       abrupt-4xCO2 CMIP5 and CMIP6 experiment.\n  ↳ \u003cmodel\u003e\n    ↳ \u003cyear\u003e         Samples generated by prepare_samples for a model/year in the abrupt-4xCO2 experiment.\n    ↳ \u003cyear\u003e.nc      Merged samples for a given year (NetCDF).\n↳ historical         historical CMIP6 experiment.\n  ↳ \u003cmodel\u003e\n    ↳ \u003cyear\u003e         Samples generated by prepare_samples for a model/year in the historical experiment.\n    ↳ \u003cyear\u003e.nc      Merged samples for a given year (NetCDF).\nsamples_pred\n↳ abrupt-4xCO2\n  ↳ \u003cmodel\u003e\n    ↳ \u003cyear\u003e.nc      Samples predicted with ann apply for a model/year in the abrupt-4xCO2 experiment.\n↳ historical\n  ↳ ceres/\u003cyear\u003e.nc  CERES samples predicted with ann apply for a year.\n  ↳ \u003cmodel\u003e\n    ↳ \u003cyear\u003e.nc      Samples predicted with ann apply for a model/year in the historical experiment.\nroc                  Receiver operating characteristic.\n↳ all.nc\n↳ regions.nc\nxval\n↳ \u003cregion\u003e           Results for an ANN trained on station data excluding a region.\n↳ geo_cto            Geographical distribution\n  ↳ all              Input files for the geo_cto_xval plots.\n    ↳ 0_xval_all.nc  Symbolic link to geo_cto/historical/validation/CERES.nc.\n    ↳ 1_xval_NW.nc   Symbolic link to xval/nw/geo_cto/historical/all/CERES.nc.\n    ↳ 2_xval_NE.nc   Symbolic link to xval/ne/geo_cto/historical/all/CERES.nc.\n    ↳ 3_xval_SE.nc   Symbolic link to xval/se/geo_cto/historical/all/CERES.nc.\n    ↳ 4_xval_SW.nc   Symbolic link to xval/sw/geo_cto/historical/all/CERES.nc.\n  ↳ regions.nc       Merged regions (NA, EA, OC and SA) geographical distribution produced by merge_xval_geo_cto.\n```\n\n## How to run\n\nThe `input` directory should be populated with the required input files before\nrunning the scripts.\n\nThe `run` bash script runs the Python scripts in `bin` for various tasks. The\ntasks can be run in a sequence as below. Before running the `run` script,\nconfiguration should be imported from one of `config_4`, `config_10` or\n`config_27` for 4, 10 and 27 cloud types, respectively. The output directories\nfor data files (NetCDF) and plots (PDF and PNG) are `data_x` and `plot_x`,\nwhere x is 4, 10, or 27, respectively. Data files common to all cloud type sets\nare stored in `data`. Some of the tasks might take a significant amount of\ntime to complete (hours to days, depending on the CPU).  In general, the tasks\nshould be run in order because of data dependencies.\n\nPlots which contain complex vector graphics are saved as PNG with width of\nslightly above 1920px. Other plots are saved as PDF.\n\n```sh\n# Optional configuration:\nexport JOBS=24 # Number of concurrent jobs. Defaults to the number of CPU cores if not set.\n\n. config_4 # Configuration for 4 cloud types\n# . config_10 for 10 cloud types.\n# . config_27 for 27 cloud types.\n# ./run prepare_* commands only have to be run once for either of config_4, config_10 and config_27 because they are shared between the configurations.\n\n./run prepare_ceres              # Prepare CERES samples.\n./run train_ann                  # Train the ANN.\n./run plot_training_history      # Plot training history [Figure S1].\n./run plot_idd_stations          # Plot IDD stations [Figure 1a].\n./run predict_ceres              # Predict CERES samples using the ANN.\n./run calc_dtau_pct              # Calculate cloud optical depth - cloud top pressure histograms.\n./run plot_dtau_pct              # Plot cloud optical depth - cloud top pressure histograms [Figure 8].\n./run prepare_historical         # Prepare CMIP6 historical samples.\n./run predict_historical         # Predict CMIP6 historical samples using the ANN.\n./run calc_geo_cto_historical    # Calculate geographical distribution of cloud type occurrence from the CMIP6 historical samples.\n./run calc_idd_geo               # Calculate geographical distribution of IDD cloud type occurrence.\n./run plot_idd_n_obs             # Plot number of observations per grid cell in the IDD dataset [Figure S2].\n./run plot_station_corr          # Plot CERES/ANN-IDD station spatial and temporal error correlation [Figure S3].\n./run plot_geo_cto_historical    # Plot geographical distribution of cloud type occurrence for the CMIP6 historical experiment [Figure 6, 7].\n./run plot_cto_historical        # Plot cloud type occurrence bar chart for the CMIP6 historical experiment [Figure 9a].\n./run plot_cto_rmse_ecs          # Plot cloud type occurrence RMSE vs. ECS [Figure 12, S10, S11].\n./run prepare_abrupt-4xCO2       # Prepare CMIP5 and CMIP6 abrupt-4xCO2 samples.\n./run predict_abrupt-4xCO2       # Predict CMIP5 and CMIP6 abrupt-4xCO2 samples using the ANN.\n./run calc_geo_cto_abrupt-4xCO2  # Calculate geographical distribution of cloud type occurrence from the CMIP5 and CMIP6 abrupt-4xCO2 samples.\n./run plot_geo_cto_abrupt-4xCO2  # Plot geographical distribution of cloud type occurrence for the CMIP5 and CMIP6 abrupt-4xCO2 experiment [Figure S7, S8].\n./run plot_cto_abrupt-4xCO2      # Plot cloud type occurrence bar chart for the CMIP5 and CMIP6 abrupt-4xCO2 experiment [Figure 9b].\n./run calc_cto_ecs               # Calculate cloud type occurrence vs. ECS regression in the CMIP5 and CMIP6 abrupt-4xCO2 experiment.\n./run plot_cto_ecs               # Plot cloud type occurrence vs. ECS regression in the CMIP5 and CMIP6 abrupt-4xCO2 experiment [Figure 11].\n./run train_ann_xval             # Train ANNs for cross-validation.\n./run predict_ceres_xval         # Predict CERES cross-validation samples using the ANN.\n./run calc_geo_cto_xval          # Calculate geographical distribution of cloud type occurrence for cross-validation.\n./run plot_geo_cto_xval          # Plot geographical distribution of cloud type occurrence for cross-validation [Figure 3, S12].\n./run plot_validation            # Plot validation results [Figure 4].\n./run calc_roc                   # Calculate ROC.\n./run plot_roc                   # Plot ROC [Figure 5].\n```\n\n### Environment variables\n\nThe run script supports the following environment variables:\n\n- `JOBS`: Number of concurrent jobs. Default: Number of CPU cores.\n- `INPUT`: Input directory. Default: `input`.\n- `DATA`: Data (output) directory. Default: `data`.\n- `DATA_COMMON`: Common data (output) directory. Default: `data`.\n- `PLOT`: Plot (output) directory. Default: `plot`.\n- `NCLASSES`: Number of cloud classes (4, 10 or 27). Default: 4.\n- `EXCLUDE_NIGHT`: Exclude samples containing nighttime. The same as the\n  equivalent [ann](#ann) option. Default: true.\n\n## Commands\n\n### Overview\n\nBelow is an overview of the available commands showing their dependencies and\nthe paper figures they produce.\n\n```\nprepare_samples\n↳ plot_idd_stations [Figure 1a]\n↳ ann\n  ↳ plot_sample [Figure 1b, c]\n  ↳ plot_training_history [Figure S1]\n  ↳ calc_dtau_pct\n    ↳ plot_dtau_pct [Figure 8]\n  ↳ calc_geo_cto\n    ↳ calc_idd_geo\n      ↳ plot_geo_cto [Figure 3, 6, 7, S7, S8, S12]\n      ↳ plot_cto_rmse_ecs [Figure 12, S9–11]\n      ↳ plot_cto [Figure 9, S4–6]\n      ↳ calc_cto_ecs\n        ↳ plot_cto_ecs [Figure 11]\n      ↳ calc_cloud_props\n        ↳ plot_cloud_props [Figure 10]\n      ↳ plot_station_corr [Figure S3]\n        ↳ plot_idd_n_obs [Figure S2]\n        ↳ merge_xval_geo_cto\n          ↳ plot_validation [Figure 4]\n          ↳ calc_roc\n            ↳ plot_roc [Figure 5]\n```\n\n### Main commands\n\nBelow is a description of the main commands. They can be run either\nindividually or with the `run` command as described above. They should be run\nin a Linux terminal (bash). The commands are located in the `bin` directory as\nshould be run from the main repository directory with `bin/\u003ccommand\u003e\n[\u003carguments\u003e...]`.\n\nSome of the commands use [PST](https://github.com/peterkuma/pst/) for command\nline argument parsing, which allows passing of complex arguments such as\narrays, but may also require escaping special characters, for example in file\nnames.\n\n#### ann\n\n\n```\nTrain or apply the artificial neural network (ANN).\n\nUsage: ann train INPUT INPUT_VAL OUTPUT OUTPUT_HISTORY [OPTIONS]\n       ann apply MODEL INPUT OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments (ann train):\n\n  INPUT           Input directory with samples. The output of prepare_samples (NetCDF).\n  INPUT_VAL       Input directory with validation samples (NetCDF).\n  OUTPUT          Output model (HDF5).\n  OUTPUT_HISTORY  History output (NetCDF).\n\nArguments (ann apply):\n\n  MODEL   TensorFlow model (HDF5).\n  INPUT   Input directory with samples. The output of prepare_samples (NetCDF).\n  OUTPUT  Output samples directory (NetCDF).\n\nOptions (ann train):\n\n  night: VALUE          Train for nighttime only. One of: true or false. Default: false.\n  exclude: { LAT1 LAT2 LON1 LON2 }\n      Exclude samples with pixels in a region bounded by given latitude and longitude. Default: none.\n  nsamples: VALUE       Maximum number of samples to use for the training per day. Default: 20.\n\nOptions:\n\n  exclude_night: VALUE  Exclude nighttime samples. One of: true or false. Default: true.\n  nclasses: VALUE  Number of cloud types. One of: 4, 10, 27. Default: 4.\n\nExamples:\n\nbin/ann train data/samples/ceres_training/training data/samples/ceres_training/validation data/ann/ceres.h5 data/ann/history.nc\nbin/ann apply data/ann/ceres.h5 data/samples/ceres data/samples_pred/ceres\nbin/ann apply data/ann/ceres.h5 data/samples/historical/AWI-ESM-1-1-LR data/samples_pred/historical/AWI-ESM-1-1-LR\n```\n\n\n#### calc\\_cloud\\_props\n\n\n```\nCalculate statistics of cloud properties by cloud type.\n\nUsage: calc_cloud_props TYPE CTO INPUT OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  TYPE    Type of input data. One of: \"ceres\" (CERES), \"cmip\" (CMIP), \"era5\" (ERA5), \"noresm2\" (NorESM2), \"merra2\" (MERRA-2).\n  CTO     Cloud type occurrence. The output of calc_geo_cto (NetCDF).\n  INPUT   CMIP cloud property (clt, cod or pctisccp) directory (NetCDF) or CERES SYN1deg (NetCDF).\n  OUTPUT  Output file (NetCDF).\n\nExamples:\n\nbin/calc_cloud_props cmip data/geo_cto/historical/all/UKESM1-0-LL.nc input/cmip6/historical/day/by-model/UKESM1-0-LL/ data/cloud_props/UKESM1-0-LL.nc\n```\n\n\n#### calc\\_cto\\_ecs\n\n\n```\nCalculate cloud type occurrence vs. ECS regression.\n\nUsage: calc_cto_ecs INPUT ECS OUTPUT\n\nArguments:\n\n  INPUT   Input directory. The output of calc_geo_cto (NetCDF).\n  ECS     ECS, TCR and CLD input (CSV).\n  OUTPUT  Output file (NetCDF).\n\nExamples:\n\nbin/calc_cto_ecs data/geo_cto/abrupt-4xCO2/ input/ecs/ecs.csv data/cto_ecs/cto_ecs.nc\n```\n\n\n#### calc\\_dtau\\_pct\n\n\n```\nCalculate cloud optical depth - cloud top press histogram.\n\nUsage: calc_dtau_pct SAMPLES CERES OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  SAMPLES  Directory with samples. The output of prepare_samples (NetCDF).\n  CERES    Directory with CERES SYN1deg (NetCDF).\n  OUTPUT   Output file (NetCDF).\n\nExamples:\n\nbin/calc_dtau_pct data/samples_pred/ceres input/ceres data/dtau_pct/dtau_pct.nc\n```\n\n\n#### calc\\_geo\\_cto\n\n\n```\nCalculate geographical distribution of cloud type occurrence distribution.\n\nUsage: calc_geo_cto INPUT [INPUT_NIGHT] TAS OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT        Input file or directory (NetCDF). The output of tf.\n  INPUT_NIGHT  Input directory daily files - nightime samples (NetCDF). The output of tf.\n  TAS          Input file with tas. The output of gistemp_to_nc (NetCDF).\n  OUTPUT       Output file (NetCDF).\n\nOptions:\n\n  resolution: VALUE  Resolution (degrees). Default: 5. 180 must be divisible by \u003cvalue\u003e.\n\nExamples:\n\nbin/calc_geo_cto data/samples_pred/ceres input/tas/historical/CERES.nc data/geo_cto/historical/all/CERES.nc\nbin/calc_geo_cto data/samples_pred/historical/AWI-ESM-1-1-LR input/tas/historical/AWI-ESM-1-1-LR data/geo_cto/historical/all/AWI-ESM-1-1-LR.nc\n```\n\n\n#### calc\\_idd\\_geo\n\n\n```\nCalculate geographical distribution of cloud types from IDD data.\n\nUsage: calc_idd_geo SYNOP BUOY FROM TO OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  SYNOP   Input synop directory (NetCDF).\n  BUOY    Input buoy directory (NetCDF).\n  FROM    From date (ISO).\n  TO      To date (ISO).\n  OUTPUT  Output file (NetCDF).\n\nOptions:\n\n  nclasses: VALUE    Number of cloud types. One of: 4, 10 or 27. Default: 4.\n  resolution: VALUE  Resolution (degrees). Default: 5. 180 must be divisible by VALUE.\n\nExamples:\n\nbin/calc_idd_geo input/idd/{synop,buoy} 2007-01-01 2007-12-31 data/idd_geo/2007.nc nclasses: 10\n```\n\n\n#### calc\\_roc\n\n\n```\nCalculate receiver operating characteristic.\n\nUsage: calc_roc INPUT IDD OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   Validation CERES/ANN dataset. The output of calc_geo_cto for validation years (NetCDF).\n  IDD     Validation IDD dataset. The output of calc_idd_geo for validation years (NetCDF).\n  OUTPUT  Output file (NetCDF).\n\nOptions:\n\n  area: { LAT1 LAT2 LON1 LON2 }  Area to validate on.\n\nExamples:\n\nbin/calc_roc data/xval/na/geo_cto/historical/all/CERES.nc data/idd_geo/IDD.nc data/roc/NE.nc area: { 0 90 -180 0 }\n```\n\n\n#### merge\\_xval\\_geo\\_cto\n\n\n```\n\nMerge cross validation geographical distribution of cloud type occurrence.\n\nUsage: merge_xval_geo_cto [INPUT...] [AREA...] OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   The output of calc_geo_cto (NetCDF).\n  AREA    Area of input to merge the format { LAT1 LAT2 LON1 LON2 }. The number of area arguments must be the same as the number of input arguments.\n  OUTPUT  Output file (NetCDF).\n\nExamples:\n\nbin/merge_xval_geo_cto data/xval/{na,ea,oc,sa}/geo_cto/historical/all/CERES.nc { 15 45 -60 -30 } { 30 60 90 120 } { -45 -15 150 180 } { -30 0 -75 -45 } data/xval/geo_cto/regions.nc\n```\n\n\n#### plot\\_cloud\\_props [Figure 11]\n\n\n```\nUsage: plot_cloud_prop VAR INPUT ECS OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  VAR     Variable. One of: \"clt\", \"cod\", \"pct\".\n  INPUT   Input directory. The output of calc_cloud_props (NetCDF).\n  ECS     ECS file (CSV).\n  OUTPUT  Output plot (PDF).\n\nOptions:\n\n  legend: VALUE  Plot legend (\"true\" or \"false\"). Default: \"true\".\n\nExamples:\n\nbin/plot_cloud_props clt data/cloud_props/ input/ecs/ecs.csv plot/cloud_props_clt.pdf\nbin/plot_cloud_props cod data/cloud_props/ input/ecs/ecs.csv plot/cloud_props_cod.pdf\nbin/plot_cloud_props pct data/cloud_props/ input/ecs/ecs.csv plot/cloud_props_pct.pdf\n```\n\n\n#### plot\\_cto [Figure 9, S4–6]\n\n\n```\nPlot global mean cloud type occurrence.\n\nUsage: plot_cto VARNAME DEGREE ABSREL REGRESSION INPUT ECS OUTPUT TITLE [OPTIONS]\n\nArguments:\n\n  VARNAME     Variable name. One of: \"ecs\" (ECS), \"tcr\" (TCR), \"cld\" (cloud feedback).\n  DEGREE      One of: \"0\" (mean), \"1-time\" (trend in time), \"1-tas\" (trend in tas).\n  ABSREL      One of \"absolute\" (absolute value), \"relative\" (relative to CERES).\n  REGRESSION  Plot regression. One of: \"true\" or \"false\".\n  INPUT       Input directoy. The output of calc_geo_cto (NetCDF).\n  ECS         ECS file (CSV).\n  OUTPUT      Output plot (PDF).\n  TITLE       Plot title.\n\nOptions:\n\n  legend: VALUE  Show legend (\"true\" or \"false\"). Default: \"true\".\n\nExamples:\n\nbin/plot_cto ecs 0 relative false data/geo_cto/historical/ input/ecs/ecs.csv plot/cto_historical.pdf 'CMIP6 historical (2003-2014) and reanalyses (2003-2020) relative to CERES (2003-2020)'\nbin/plot_cto ecs 1-tas absolute false data/geo_cto/abrupt-4xCO2/ input/ecs/ecs.csv plot/cto_abrupt-4xCO2.pdf 'CMIP abrupt-4xCO2 (first 100 years)'\n```\n\n\n#### plot\\_cto\\_ecs [Figure 11]\n\n\n```\nPlot cloud type occurrence vs. ECS regression.\n\nUsage: plot_cto_ecs VARNAME INPUT SUMMARY OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  VARNAME  Variable name. One of: \"ecs\" (ECS), \"tcr\" (TCR), \"cld\" (cloud feedback).\n  INPUT    Input file. The output of calc_cto_ecs (NetCDF).\n  OUTPUT   Output plot (PDF).\n\nExamples:\n\nbin/plot_cto_ecs ecs data/cto_ecs/cto_ecs.nc plot/cto_ecs.pdf\n```\n\n\n#### plot\\_cto\\_rmse\\_ecs [Figure 12, S9–11]\n\n\n```\nPlot scatter plot of RMSE of the geographical distribution of cloud type occurrence and sensitivity indicators (ECS, TCR and cloud feedback).\n\nUsage: plot_cto_rmse_ecs INPUT ECS OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   Input directory. The output of calc_geo_cto or calc_cto (NetCDF).\n  ECS     ECS file (CSV).\n  OUTPUT  Output plot (PDF).\n\nOptions:\n\n  legend: VALUE  Plot legend (\"true\" or \"false\"). Default: \"true\".\n\nExamples:\n\nbin/plot_cto_rmse_ecs data/geo_cto/historical/all input/ecs/ecs.csv plot/geo_cto_rmse_ecs.pdf\n```\n\n\n#### plot\\_dtau\\_pct [Figure 8]\n\n\n```\nPlot cloud optical depth - cloud top pressure histogram.\n\nUsage: plot_dtau_pct INPUT OUTPUT\n\nArguments:\n\n  INPUT   Input file. The output of calc_dtau_pct (NetCDF).\n  OUTPUT  Output plot (PDF).\n\nExamples:\n\nbin/plot_dtau_pct data/dtau_pct/dtau_pct.nc plot/dtau_pct.png\n```\n\n\n#### plot\\_geo\\_cto [Figure 3, 6, 7, S7, S8, S12]\n\n\n```\nPlot geographical distribution of cloud type occurrence.\n\nUsage: plot_geo_cto INPUT ECS OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   Input directory. The output of calc_geo_cto (NetCDF).\n  ECS     ECS file (CSV).\n  OUTPUT  Output plot (PDF).\n\nOptions:\n\n  degree: VALUE       Degree. One of: 0 (absolute value) or 1 (trend). Default: 0.\n  relative: VALUE     Plot relative to CERES. One of: true or false. Default: true.\n  normalized: VALUE   Plot normaized CERES. One of: true, false, only.  Default: false.\n  with_ref: VALUE     Plot reference row. One of: true, false. Default: true.\n  label_start: VALUE  Start plot labels with letter VALUE. Default: a.\n\nExamples:\n\nbin/plot_geo_cto data/geo_cto/historical/part_1 input/ecs/ecs.csv plot/geo_cto_historical_1.png\nbin/plot_geo_cto data/geo_cto/historical/part_2 input/ecs/ecs.csv plot/geo_cto_historical_2.png\n```\n\n\n#### plot\\_idd\\_n\\_obs [Figure S2]\n\n\n```\nPlot a map showing the number of observations in IDD.\n\nUsage: plot_idd_n_obs INPUT OUTPUT\n\nArguments:\n\n  INPUT   Input dataset. The output of calc_idd_geo (NetCDF).\n  OUTPUT  Output plot (PDF).\n\nExamples:\n\nbin/plot_idd_n_obs data/idd_geo/validation.nc plot/idd_n_obs.png\n```\n\n\n#### plot\\_idd\\_stations [Figure 1a]\n\n\n```\nPlot IDD stations on a map.\n\nUsage: plot_idd_stations INPUT SAMPLE N OUTPUT TITLE\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   IDD input directory (NetCDF).\n  SAMPLE  CERES sample. The output of tf apply (NetCDF).\n  N       Sample number.\n  OUTPUT  Output plot (PDF).\n  TITLE   Plot title.\n\nExamples:\n\nbin/plot_idd_stations data/idd_sample/ data/samples/ceres/2010/2010-01-01T00\\:00\\:00.nc 0 plot/idd_stations.png '2010-01-01'\n```\n\n\n#### plot\\_roc [Figure 5]\n\n\n```\nPlot ROC validation curves.\n\nUsage: plot_roc INPUT OUTPUT TITLE\n\nArguments:\n\n  INPUT   Input data. The output of calc_val_stats (NetCDF).\n  OUTPUT  Output plot (PDF)\n  TITLE   Plot title.\n\nExamples:\n\nbin/plot_roc data/roc/all.nc plot/roc_all.pdf all\nbin/plot_roc data/roc/regions.nc plot/roc_regions.pdf regions\n```\n\n\n#### plot\\_sample [Figure 1b, c]\n\n\n```\nPlot sample.\n\nUsage: plot_samples INPUT N OUTPUT\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  INPUT   Input sample (NetCDF). The output of tf.\n  N       Sample number.\n  OUTPUT  Output plot (PDF).\n\nExamples:\n\nbin/plot_sample data/samples/ceres_training/2010/2010-01-01T00\\:00\\:00.nc 0 plot/sample.png\n```\n\n\n#### plot\\_station\\_corr [Figure S3]\n\n\n```\nPlot spatial and temporal correlation of stations.\n\nUsage: plot_station_corr TYPE INPUT1 INPUT2 OUTPUT\n\nArguments:\n\n  TYPE    One of: \"time\" (time correlation), \"space\" (space correlation).\n  INPUT1  Input file. The output of calc_idd_geo (NetCDF).\n  INPUT2  Input file. The output of calc_geo_cto (NetCDF).\n  OUTPUT  Output plot (PDF).\n\nExamples:\n\nbin/plot_station_corr space data/idd_geo/2007.nc data/geo_cto/historical/all/CERES.nc plot/station_corr_space.pdf\nbin/plot_station_corr time data/idd_geo/2007.nc data/geo_cto/historical/all/CERES.nc plot/station_corr_time.pdf\n```\n\n\n#### plot\\_training\\_history [Figure S1]\n\n\n```\nPlot training history loss function.\n\nUsage: plot_history INPUT OUTPUT\n\nArguments:\n\n  INPUT   Input history file. The output of tf (NetCDF).\n  OUTPUT  Output plot (PDF).\n\nExamples:\n\nbin/plot_training_history data/ann/history.nc plot/training_history.pdf\n```\n\n\n#### plot\\_validation [Figure 4]\n\n\n```\nCalculate cross-validation statistics.\n\nUsage: plot_validation IDD_VAL IDD_TRAIN INPUT... OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  IDD_VAL    Validation IDD dataset. The output of calc_idd_geo for validation years (NetCDF).\n  IDD_TRAIN  Training IDD dataset. The output of calc_idd_geo for training years (NetCDF).\n  INPUT      CERES dataset. The output of calc_geo_cto or merge_xval_geo_cto (NetCDF).\n  OUTPUT     Output plot (PDF).\n\nOptions:\n\n  --normalized  Plot normalized plots.\n\nExamples:\n\nbin/plot_validation data/idd_geo/{validation,training}.nc data/geo_cto/historical/all/CERES.nc data/xval/geo_cto/CERES_sectors.nc plot/validation.png\n```\n\n\n#### prepare\\_samples\n\n\n```\nPrepare samples of clouds for CNN training.\n\nUsage: prepare_samples TYPE INPUT SYNOP BUOY START END OUTPUT [OPTIONS]\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  TYPE    Input type. One of: \"ceres\" (CERES SYN 1deg), \"cmip\" (CMIP5/6), \"cloud_cci\" (Cloud_cci), \"era5\" (ERA5), \"merra2\" (MERRA-2), \"noresm2\" (NorESM).\n  INPUT   Input directory with input files (NetCDF).\n  SYNOP   Input directory with IDD synoptic files or \"none\" (NetCDF).\n  BUOY    Input directory with IDD buoy files or \"none\" (NetCDF).\n  START   Start time (ISO).\n  END     End time (ISO).\n  OUTPUT  Output directory.\n\nOptions:\n\n  seed: VALUE           Random seed.\n  keep_stations: VALUE  Keep station records in samples (\"true\" or \"false\"). Default: \"false\".\n  nsamples: VALUE       Number of samples per day to generate. Default: 100.\n\nExamples:\n\nprepare_samples ceres input/ceres input/idd/synop input/idd/buoy 2009-01-01 2009-12-31 data/samples/ceres/2009\nprepare_samples cmip input/cmip6/historical/day/by-model/AWI-ESM-1-1-LR none none 2003-01-01 2003-12-31 data/samples/historical/AWI-ESM-1-1-LR/2003\n```\n\n\n### Auxiliary commands\n\n#### build\\_readme\n\n\n```\nBuild the README document from a template.\n\nUsage: build_readme INPUT BINDIR OUTPUT\n\nArguments:\n\n  INPUT   Input file.\n  BINDIR  Directory with scripts.\n  OUTPUT  Output file.\n\nExamples:\n\nbin/build_readme README.md.in bin README.md\n```\n\n\n#### create\\_by\\_model\n\n```\nCreate a by-model index of CMIP data. This command should be run in the directory with CMIP data.\n\nUsage: create_by_model\n\nExamples:\n\ncd data/cmip5/historical/day\n./create_by_model\n```\n\n#### create\\_by\\_var\n\n```\nCreate a by-var index of CMIP data. This command should be run in the directory with CMIP data.\n\nUsage: create_by_var\n\nExamples:\n\ncd data/cmip5/historical/day\n./create_by_var\n```\n\n#### download\\_cmip\n\n\n```\nDownload CMIP data based on a JSON catalogue downloaded from the CMIP archive search page.\n\nUsage: download_cmip FILENAME VAR START END\n\nThis program uses PST for command line argument parsing.\n\nArguments:\n\n  FILENAME  Input file (JSON).\n  VAR       Variable name.\n  START     Start time (ISO).\n  END       End time (ISO).\n\nExamples:\n\nbin/download_cmip catalog.json tas 1850-01-01 2014-01-01 \u003e files\n```\n\n\n#### gistemp\\_to\\_nc\n\n\n```\nConvert GISTEMP yearly temperature data to NetCDF.\n\nUsage: gistemp_to_nc INPUT OUTPUT\n\nArguments:\n\n  INPUT   Input file \"totalCI_ERA.csv\" (CSV).\n  OUTPUT  Output file (NetCDF).\n\nExamples:\n\nbin/gistemp_to_nc data/gistemp/totalCI_ERA.csv data/gistemp/gistemp.nc\n```\n\n\n## Code style\n\nThe code style is indentation with tabs, tab size equivalent to 4 spaces, and\nUnix line endings (LF). The style is applied automatically in editors which\nsupports the `.editorconfig` standard.\n\n## Release notes\n\n### 2.0.0 (2022-12-05)\n\n- A release corresponding to the finalized manuscript at the end of the peer\nreview process.\n\n### 1.0.0 (2022-03-07)\n\n- The first release corresponding to the submitted manuscript version.\n\n## License\n\nThe code in this repository is open source, and can be used and distributed\nfreely under the terms of an MIT license. Please see [LICENSE.md](LICENSE.md)\nfor details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterkuma%2Fml-clouds-2021","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeterkuma%2Fml-clouds-2021","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterkuma%2Fml-clouds-2021/lists"}