{"id":13701701,"url":"https://github.com/google-research/rliable","last_synced_at":"2025-04-08T04:14:47.228Z","repository":{"id":37604891,"uuid":"398111584","full_name":"google-research/rliable","owner":"google-research","description":"[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.","archived":false,"fork":false,"pushed_at":"2024-05-28T20:32:15.000Z","size":1952,"stargazers_count":708,"open_issues_count":2,"forks_count":42,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-05-29T11:28:14.052Z","etag":null,"topics":["benchmarking","evaluation-metrics","google","machine-learning","reinforcement-learning","rl"],"latest_commit_sha":null,"homepage":"https://agarwl.github.io/rliable","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-20T00:41:06.000Z","updated_at":"2024-06-18T16:45:17.295Z","dependencies_parsed_at":"2024-05-28T23:32:49.144Z","dependency_job_id":"c563d6eb-4b98-4197-a3ef-1fcd661a3b76","html_url":"https://github.com/google-research/rliable","commit_stats":{"total_commits":57,"total_committers":9,"mean_commits":6.333333333333333,"dds":0.5614035087719298,"last_synced_commit":"b11d308fd4afb3e20f1a01f42cdcb30f40fc9f93"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Frliable","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Frliable/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Frliable/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Frliable/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/rliable/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247773719,"owners_count":20993639,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","evaluation-metrics","google","machine-learning","reinforcement-learning","rl"],"created_at":"2024-08-02T20:01:55.098Z","updated_at":"2025-04-08T04:14:47.202Z","avatar_url":"https://github.com/google-research.png","language":"Jupyter Notebook","funding_links":[],"categories":["Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL)","Jupyter Notebook","其他_机器学习与深度学习","Tools"],"sub_categories":["RL/DRL Benchmarking","Performance (\u0026 Automated ML)"],"readme":"\n# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1a0pSD-1tWhMmeJeeoyZM1A-HCW3yf1xR?usp=sharing) [![Website](https://img.shields.io/badge/www-Website-green)](https://agarwl.github.io/rliable) [![Blog](https://img.shields.io/badge/b-Blog-blue)](https://ai.googleblog.com/2021/11/rliable-towards-reliable-evaluation.html)\n\n`rliable` is an open-source Python library for reliable evaluation, even with a *handful\nof runs*, on reinforcement learning and machine learnings benchmarks. \n| **Desideratum** | **Current evaluation approach** |  **Our Recommendation**    |\n| --------------------------------- | ----------- | --------- |\n| Uncertainty in aggregate performance | **Point estimates**: \u003cul\u003e \u003cli\u003e Ignore statistical uncertainty \u003c/li\u003e \u003cli\u003e Hinder *results reproducibility* \u003c/li\u003e\u003c/ul\u003e | Interval estimates using **stratified bootstrap confidence intervals** (CIs) |\n|Performance variability across tasks and runs| **Tables with task mean scores**: \u003cul\u003e\u003cli\u003e Overwhelming beyond a few tasks \u003c/li\u003e \u003cli\u003e Standard deviations frequently omitted \u003c/li\u003e \u003cli\u003e Incomplete picture for multimodal and heavy-tailed distributions \u003c/li\u003e \u003c/ul\u003e | **Score distributions** (*performance profiles*): \u003cul\u003e \u003cli\u003e Show tail distribution of scores on combined runs across tasks \u003c/li\u003e \u003cli\u003e Allow qualitative comparisons \u003c/li\u003e \u003cli\u003e Easily read any score percentile \u003c/li\u003e \u003c/ul\u003e|\n|Aggregate metrics for summarizing benchmark performance | **Mean**:  \u003cul\u003e\u003cli\u003e Often dominated by performance on outlier tasks \u003c/li\u003e\u003c/ul\u003e \u0026nbsp; **Median**: \u003cul\u003e \u003cli\u003e Statistically inefficient (requires a large number of runs to claim improvements) \u003c/li\u003e  \u003cli\u003e Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it \u003c/li\u003e \u003c/ul\u003e| **Interquartile Mean (IQM)** across all runs: \u003cul\u003e \u003cli\u003e Performance on middle 50% of combined runs \u003c/li\u003e \u003cli\u003e Robust to outlier scores but more statistically efficient than median \u003c/li\u003e \u003c/ul\u003e To show other aspects of performance gains, report *Probability of improvement* and *Optimality gap* |\n\n`rliable` provides support for:\n\n * Stratified Bootstrap Confidence Intervals (CIs)\n * Performance Profiles (with plotting functions)\n * Aggregate metrics\n   * Interquartile Mean (IQM) across all runs\n   * Optimality Gap\n   * Probability of Improvement\n\n\u003cdiv align=\"left\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/aggregate_metric.png\"\u003e\n\u003c/div\u003e\n\n## Interactive colab\nWe provide a colab at [bit.ly/statistical_precipice_colab](https://colab.research.google.com/drive/1a0pSD-1tWhMmeJeeoyZM1A-HCW3yf1xR?usp=sharing),\nwhich shows how to use the library with examples of published algorithms on\nwidely used benchmarks including Atari 100k, ALE, DM Control and Procgen.\n\n### Data for individual runs on Atari 100k, ALE, DM Control and Procgen\n\nYou can access the data for individual runs using the public GCP bucket here (you might need to sign in with your\ngmail account to use Gcloud) : https://console.cloud.google.com/storage/browser/rl-benchmark-data.\nThe interactive colab above also allows you to access the data programatically.\n\n### Paper\nFor more details, refer to the accompanying **NeurIPS 2021** paper (**Outstanding Paper** Award):\n[Deep Reinforcement Learning at the Edge of the Statistical Precipice](https://arxiv.org/pdf/2108.13264.pdf).\n\n\n### Installation\n\nTo install `rliable`, run:\n```python\npip install -U rliable\n```\n\nTo install latest version of `rliable` as a package, run:\n\n```python\npip install git+https://github.com/google-research/rliable\n```\n\nTo import `rliable`, we suggest:\n\n```python\nfrom rliable import library as rly\nfrom rliable import metrics\nfrom rliable import plot_utils\n```\n\n### Aggregate metrics with 95% Stratified Bootstrap CIs\n\n\n##### IQM, Optimality Gap, Median, Mean\n```python\nalgorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',\n              'IQN', 'M-IQN', 'DreamerV2']\n# Load ALE scores as a dictionary mapping algorithms to their human normalized\n# score matrices, each of which is of size `(num_runs x num_games)`.\natari_200m_normalized_score_dict = ...\naggregate_func = lambda x: np.array([\n  metrics.aggregate_median(x),\n  metrics.aggregate_iqm(x),\n  metrics.aggregate_mean(x),\n  metrics.aggregate_optimality_gap(x)])\naggregate_scores, aggregate_score_cis = rly.get_interval_estimates(\n  atari_200m_normalized_score_dict, aggregate_func, reps=50000)\nfig, axes = plot_utils.plot_interval_estimates(\n  aggregate_scores, aggregate_score_cis,\n  metric_names=['Median', 'IQM', 'Mean', 'Optimality Gap'],\n  algorithms=algorithms, xlabel='Human Normalized Score')\n```\n\n\u003cdiv align=\"left\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/ale_interval_estimates.png\"\u003e\n\u003c/div\u003e\n\n##### Probability of Improvement\n```python\n# Load ProcGen scores as a dictionary containing pairs of normalized score\n# matrices for pairs of algorithms we want to compare\nprocgen_algorithm_pairs = {.. , 'x,y': (score_x, score_y), ..}\naverage_probabilities, average_prob_cis = rly.get_interval_estimates(\n  procgen_algorithm_pairs, metrics.probability_of_improvement, reps=2000)\nplot_utils.plot_probability_of_improvement(average_probabilities, average_prob_cis)\n```\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/procgen_probability_of_improvement.png\"\u003e\n\u003c/div\u003e\n\n#### Sample Efficiency Curve\n```python\nalgorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',\n              'IQN', 'M-IQN', 'DreamerV2']\n# Load ALE scores as a dictionary mapping algorithms to their human normalized\n# score matrices across all 200 million frames, each of which is of size\n# `(num_runs x num_games x 200)` where scores are recorded every million frame.\nale_all_frames_scores_dict = ...\nframes = np.array([1, 10, 25, 50, 75, 100, 125, 150, 175, 200]) - 1\nale_frames_scores_dict = {algorithm: score[:, :, frames] for algorithm, score\n                          in ale_all_frames_scores_dict.items()}\niqm = lambda scores: np.array([metrics.aggregate_iqm(scores[..., frame])\n                               for frame in range(scores.shape[-1])])\niqm_scores, iqm_cis = rly.get_interval_estimates(\n  ale_frames_scores_dict, iqm, reps=50000)\nplot_utils.plot_sample_efficiency_curve(\n    frames+1, iqm_scores, iqm_cis, algorithms=algorithms,\n    xlabel=r'Number of Frames (in millions)',\n    ylabel='IQM Human Normalized Score')\n```\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/ale_legend.png\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/atari_sample_efficiency_iqm.png\"\u003e\n\u003c/div\u003e\n\n### Performance Profiles\n\n```python\n# Load ALE scores as a dictionary mapping algorithms to their human normalized\n# score matrices, each of which is of size `(num_runs x num_games)`.\natari_200m_normalized_score_dict = ...\n# Human normalized score thresholds\natari_200m_thresholds = np.linspace(0.0, 8.0, 81)\nscore_distributions, score_distributions_cis = rly.create_performance_profile(\n    atari_200m_normalized_score_dict, atari_200m_thresholds)\n# Plot score distributions\nfig, ax = plt.subplots(ncols=1, figsize=(7, 5))\nplot_utils.plot_performance_profiles(\n  score_distributions, atari_200m_thresholds,\n  performance_profile_cis=score_distributions_cis,\n  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),\n  xlabel=r'Human Normalized Score $(\\tau)$',\n  ax=ax)\n```\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/ale_legend.png\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/google-research/rliable/master/images/ale_score_distributions_new.png\"\u003e\n\u003c/div\u003e\n\nThe above profile can also be plotted with non-linear scaling as follows:\n\n```python\nplot_utils.plot_performance_profiles(\n  perf_prof_atari_200m, atari_200m_tau,\n  performance_profile_cis=perf_prof_atari_200m_cis,\n  use_non_linear_scaling=True,\n  xticks = [0.0, 0.5, 1.0, 2.0, 4.0, 8.0]\n  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),\n  xlabel=r'Human Normalized Score $(\\tau)$',\n  ax=ax)\n```\n\n\n### Dependencies\nThe code was tested under `Python\u003e=3.7` and uses these packages:\n\n- arch == 5.3.0\n- scipy \u003e= 1.7.0\n- numpy \u003e= 0.9.0\n- absl-py \u003e= 1.16.4\n- seaborn \u003e= 0.11.2\n\nCiting\n------\nIf you find this open source release useful, please reference in your paper:\n\n    @article{agarwal2021deep,\n      title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},\n      author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel\n              and Courville, Aaron and Bellemare, Marc G},\n      journal={Advances in Neural Information Processing Systems},\n      year={2021}\n    }\n\nDisclaimer: This is not an official Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Frliable","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Frliable","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Frliable/lists"}