{"id":19139619,"url":"https://github.com/microsoft/llf-bench","last_synced_at":"2025-10-30T13:14:21.030Z","repository":{"id":212006401,"uuid":"678489493","full_name":"microsoft/LLF-Bench","owner":"microsoft","description":"A benchmark for evaluating learning agents based on just language feedback","archived":false,"fork":false,"pushed_at":"2025-03-25T06:01:54.000Z","size":15359,"stargazers_count":71,"open_issues_count":0,"forks_count":14,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-07T07:06:01.635Z","etag":null,"topics":["large-language-models","llm","llm-training","llms","machine-learning","natural-language-processing","reinforcement-learning"],"latest_commit_sha":null,"homepage":"https://microsoft.github.io/LLF-Bench/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-14T17:07:13.000Z","updated_at":"2025-04-04T01:58:07.000Z","dependencies_parsed_at":"2023-12-16T19:50:48.518Z","dependency_job_id":"5e913929-8183-4d11-b352-1943c51f47e1","html_url":"https://github.com/microsoft/LLF-Bench","commit_stats":null,"previous_names":["microsoft/llf-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FLLF-Bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FLLF-Bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FLLF-Bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FLLF-Bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/LLF-Bench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247608151,"owners_count":20965952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","llm","llm-training","llms","machine-learning","natural-language-processing","reinforcement-learning"],"created_at":"2024-11-09T07:14:17.044Z","updated_at":"2025-10-30T13:14:15.993Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLF-Bench: Benchmark for Interactive Learning from Language Feedback\n\nLLF Bench is a benchmark that provides a diverse collection of interactive learning problems where the agent gets language feedback instead of rewards (as in RL) or action feedback (as in imitation learning). The associated website and paper are:\n\n**Website:** https://microsoft.github.io/LLF-Bench/\n\n**Paper:** https://arxiv.org/abs/2312.06853\n\n## *Table of Contents*\n1. [**Overview**](#overview)\n2. [**Design principles**](#design-principles)\n3. [**Installation**](#installation)\n4. [**Special Instructions for running Alfworld**](#special-instructions-for-running-alfworld)\n5. [**Special Instructions for running Metaworld**](#special-instructions-for-running-metaworld)\n6. [**Examples**](#examples)\n7. [**Testing**](#testing)\n8. [**Baseline and skyline results**](#baseline-and-skyline-results)\n9. [**Contributing**](#contributing)\n10. [**Trademarks**](#trademarks)\n\n\n## Overview\n\n\u003cimg src=\"https://microsoft.github.io/LLF-Bench/images/llf-bench.png\" width=\"750\"\u003e\n\n\nEach benchmark environment here follows the gym api.\n\n    observation_dict, info = env.reset()\n    observation_dict, reward, terminated, truncated, info = env.step(action)\n\n`observation_dict` contains three fields:\n\n- 'observation': a (partial) observation of the environment's state\n- 'instruction': a natural language description of the task, including the objective and information about the action space, etc.\n- 'feedback':  a natural language feedback to help the agent to better learn to solve the task.\nWhen a field is missing, its value is represented as None. For example, 'instruction' is typically only given by `reset` whereas 'feedback' is only given by `step`.\n\n`reward` is intended for evaluating an agent's performance. It should **not** be passed to the learning agent.\n\n`terminated` indicates whether a task has been solved (i.e., the goal has been reached) or not.\n`truncated` indicates whether the maximal episode length has been reached.\n`info` returns an additional info dict of the environment.\n\n\n## Design principles\n\nWe design LLF-Bench as a benchmark to test the **learning** ability of interactive agents. We design each environment in LLF-Bench such that, from 'observation' and 'instruction' in `observation_dict`, it is sufficient (for a human) to tell when the task is indeed solved. Therefore, a policy that operates based purely on 'observation' and 'instruction' can solve these problems. However, we also design these environments such that 'observation' and 'instruction' are not sufficient for designing or efficiently learning the optimal policies. Each environment here is designed to have some ambiguities and latent characteristics in the dynamics, reward, or termination, so that the agent cannot infer the optimal policy just based on 'instruction'.\n\nThese features are designed to test an agent's *learning* ability, especially, the ability to learn from language feedback. Language feedback can be viewed as a generalization of reward feedback in reinforcement learning. It can not only provide information about reward/success, but it can also convey expressive feedback such as explanations and suggestions. The language feedback is implemented as the field 'feedback' in `observation_dict`, which is to help the agent to learn better.\n\n\n## Installation\n\nCreate conda env.\n\n    conda create -n LLF-Bench python=3.8 -y\n    conda activate LLF-Bench\n\nInstall the repo.\n\n    pip install -e .\nor\n    pip install -e .[#option1,#option2,etc.]\n\nSome valid options:\n\n    metaworld: for using metaworld envs\n    alfworld: for using alfworld envs\n\nFor example, to use metaworld, install the repo by `pip install -e .[metaworld]`.\n\n### Special Instructions for running Alfworld\n\nAlfworld requires python3.9 so please use python3.9 when creating the conda environment. Activate the conda environment, clone the LLFbench repo and install it using\n\n`pip install -e .[alfworld]`\n\nWhen the first time you will run the alfworld environment, it will download additional files. You dont need to do any of this. Alfworld also uses a config.yaml file\nthat changes the environment. We use the config yaml file provided here: [llfbench/envs/alfworld/base_config.yaml](https://github.com/microsoft/LLF-Bench/blob/main/llfbench/envs/alfworld/base_config.yaml). If you get some path errors, please ensure the source directory is referencing this file correctly. This is done in the code [here](https://github.com/microsoft/LLF-Bench/blob/main/llfbench/envs/alfworld/alfworld.py#L47).\n\n### Special Instructions for running Metaworld\n\nYou should use python3.8 and install the repo with metaworld by running `pip install -e .[metaworld]`.\n\nFor `metaworld` option, it requires libGL, which can be installed by\n\n    sudo apt-get install ffmpeg libsm6 libxext6\n\nFor `reco` option, please follow the instruction here to register and get your own user key:\n\nhttps://www.omdbapi.com/apikey.aspx\n\nThen, you can set the environment variable `OMDB_API_KEY` to your key:\n```bash\nexport OMDB_API_KEY=your_key\n```\n\n## Examples\n\nThis sample code creates an environment implemented in LLF-Bench, and creates an agent that interacts with it. The agent simply prints each observation to the console and takes console input as actions to be relayed to the environment.\n\n```python\nimport llfbench as gym\n\n# Environments in the benchmark are registered following\n# the naming convention of llf-*\n\nenv = gym.make('llf-gridworld-v0')\n\ndone = False\ncumulative_reward = 0.0\n\n# First observation is acquired by resetting the environment\n\nobservation, info = env.reset()\n\nwhile not done:\n    # Observation is dict having 'observation', 'instruction', 'feedback'\n    # Here we print the observation and ask the user for an action\n\n    action = input( observation['observation'] + '\\n' +\n                    observation['instruction'] + '\\n' +\n                    observation['feedback'] + '\\n' +\n                    'Action: ' )\n\n    # Gridworld has a text action space, so TextWrapper is not needed\n    # to parse a valid action from the input string\n\n    observation, reward, terminated, truncated, info = env.step(action)\n\n    # reward is never revealed to the agent; only used for evaluation\n\n    cumulative_reward += reward\n\n    # terminated and truncated follow the same semantics as in Gymnasium\n\n    done = terminated or truncated\n\nprint(f'Episode reward: {cumulative_reward}')\n```\n\n\n## Testing\n\nThe `tests` folder in the repo contains a few helpful scripts for testing the functionality of LLF-Bench.\n- *test_agents.py*: Creates a `UserAgent` that prints the 'observation' and 'feedback' produced by an LLF-Bench environment to the console, and reads user input from the console as an 'action'.\n- *test_basic_agents.py*: For a subset of LLF-Bench environments that support either a finite action space or admit a pre-built expert optimal policy, this script creates a `RandomActionAgent` and `ExpertActionAgent` to test supported LLF-Bench environments.\n- *test_envs.py*: Syntactically tests environments added to the LLF-Bench environment registry so as to be compatible with the expected semantics of LLF-Bench. This is a useful script to run on any new environments that are added or existing environments are customized in the benchmark.\n\n## Baseline and skyline results\n\n\n***\u003cspan style=\"color:red\"\u003eLast updated: 06.12.2024\u003c/span\u003e***\n\n\n\u003cimg src=\"./all_feedback.jpg\" width=\"750\"\u003e\n\nPerformance of basic agents using different LLMs, where the agents receive **all types feedback** and append the observation and feedback history to their contexts after each step. These numbers can be viewed as **\"skyline\"** performance, since receiving all feedback types typically provides all information to solve the problem near-optimally.\n\u0026nbsp;\n\u0026nbsp;\n\n\n\n\u003cimg src=\"./partial_feedback.jpg\" width=\"750\"\u003e\n\nPerformance of basic agents using different LLMs, where th agents receive **only reward, hindsight positive, and hindsight negative feedback** and append the observation and feedback history to their contexts after each step. These numbers can be viewed as **\"baseline\"** performance.\n\n\nDetails: For GPT-3.5-Turbo and GPT-4, the statistics are computed over 10 episodes for all problem sets except Alfworld, for which, due to high problem instance variability, we used 50 episodes. For other language models, 50 episodes are used for all problem sets. For Metaworld, Alfworld, and Gridworld, the mean return is defined as the policy's success rate, which uniquely determines the standard error. Therefore, for the problems from these three problem sets, the st.e. is shown in gray.\n\n## Dataset Metadata\nThe following table is necessary for this dataset to be indexed by search\nengines such as \u003ca href=\"https://g.co/datasetsearch\"\u003eGoogle Dataset Search\u003c/a\u003e.\n\u003cdiv itemscope itemtype=\"http://schema.org/Dataset\"\u003e\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eproperty\u003c/th\u003e\n    \u003cth\u003evalue\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ename\u003c/td\u003e\n    \u003ctd\u003e\u003ccode itemprop=\"name\"\u003eLLF-Bench: Benchmark for Interactive Learning from Language Feedback\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ealternateName\u003c/td\u003e\n    \u003ctd\u003e\u003ccode itemprop=\"alternateName\"\u003eLLF-Bench\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eurl\u003c/td\u003e\n    \u003ctd\u003e\u003ccode itemprop=\"url\"\u003ehttps://microsoft.github.io/LLF-Bench/\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003edescription\u003c/td\u003e\n    \u003ctd\u003e\u003ccode itemprop=\"description\"\u003eLLF Bench is a benchmark that provides a diverse collection of interactive learning problems where the agent gets language feedback instead of rewards (as in RL) or action feedback (as in imitation learning). \u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eprovider\u003c/td\u003e\n    \u003ctd\u003e\n      \u003cdiv itemscope itemtype=\"http://schema.org/Organization\" itemprop=\"provider\"\u003e\n        \u003ctable\u003e\n          \u003ctr\u003e\n            \u003cth\u003eproperty\u003c/th\u003e\n            \u003cth\u003evalue\u003c/th\u003e\n          \u003c/tr\u003e\n          \u003ctr\u003e\n            \u003ctd\u003ename\u003c/td\u003e\n            \u003ctd\u003e\u003ccode itemprop=\"name\"\u003eMicrosoft\u003c/code\u003e\u003c/td\u003e\n          \u003c/tr\u003e\n          \u003ctr\u003e\n            \u003ctd\u003esameAs\u003c/td\u003e\n            \u003ctd\u003e\u003ccode itemprop=\"sameAs\"\u003ehttps://microsoft.com//\u003c/code\u003e\u003c/td\u003e\n          \u003c/tr\u003e\n        \u003c/table\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003elicense\u003c/td\u003e\n    \u003ctd\u003e\n      \u003cdiv itemscope itemtype=\"http://schema.org/CreativeWork\" itemprop=\"license\"\u003e\n        \u003ctable\u003e\n          \u003ctr\u003e\n            \u003cth\u003eproperty\u003c/th\u003e\n            \u003cth\u003evalue\u003c/th\u003e\n          \u003c/tr\u003e\n          \u003ctr\u003e\n            \u003ctd\u003ename\u003c/td\u003e\n            \u003ctd\u003e\u003ccode itemprop=\"name\"\u003eMIT License\u003c/code\u003e\u003c/td\u003e\n          \u003c/tr\u003e\n          \u003ctr\u003e\n            \u003ctd\u003eurl\u003c/td\u003e\n            \u003ctd\u003e\u003ccode itemprop=\"url\"\u003ehttps://github.com/microsoft/LLF-Bench/blob/main/LICENSE/\u003c/code\u003e\u003c/td\u003e\n          \u003c/tr\u003e\n        \u003c/table\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ecitation\u003c/td\u003e\n    \u003ctd\u003e\u003ccode itemprop=\"citation\"\u003eCheng, C. A., Kolobov, A., Misra, D., Nie, A., \u0026 Swaminathan, A. (2023). Llf-bench: Benchmark for interactive learning from language feedback. arXiv preprint arXiv:2312.06853.\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fllf-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fllf-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fllf-bench/lists"}