{"id":13568764,"url":"https://github.com/snap-stanford/MLAgentBench","last_synced_at":"2025-04-04T05:30:29.432Z","repository":{"id":191871082,"uuid":"684903423","full_name":"snap-stanford/MLAgentBench","owner":"snap-stanford","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-19T13:53:23.000Z","size":4856,"stargazers_count":278,"open_issues_count":5,"forks_count":45,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-05T13:47:02.153Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/snap-stanford.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-30T04:54:00.000Z","updated_at":"2025-03-04T08:38:48.000Z","dependencies_parsed_at":"2023-09-01T08:27:43.091Z","dependency_job_id":"83d72954-fb31-489e-a2b5-db5ee1e1bace","html_url":"https://github.com/snap-stanford/MLAgentBench","commit_stats":null,"previous_names":["snap-stanford/mlagentbench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-stanford%2FMLAgentBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-stanford%2FMLAgentBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-stanford%2FMLAgentBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-stanford%2FMLAgentBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/snap-stanford","download_url":"https://codeload.github.com/snap-stanford/MLAgentBench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247128696,"owners_count":20888232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:00:31.482Z","updated_at":"2025-04-04T05:30:28.871Z","avatar_url":"https://github.com/snap-stanford.png","language":"Python","readme":"# MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation\n\nMLAgentBench is a suite of end-to-end Machine Learning (ML) experimentation tasks for benchmarking AI agents, where the agent aims to take a given \ndataset and a machine learning task description and autonomously develop or improve an ML model. Paper: https://arxiv.org/abs/2310.03302\n![](figs/main.png)\n\nOur AI agent in action on MLAgentBench:\n[![Watch the video](https://img.youtube.com/vi/s9NANrjLEZs/maxresdefault.jpg)](https://youtu.be/s9NANrjLEZs)\n\nEach task is an interactive environment that directly resembles what human researchers see,\nwhere an agent can read available files, run multiple experiments on a compute cluster, and analyze results to achieve the specified research goal. \nSpecifically, we include 13 diverse ML engineering tasks,\nachievable by trying different machine learning methods, data processing, architectures, training processes, etc:\n![](figs/table.png)\n\n\n# Setup\n\nThe MLAgentBench package can be installed with\n```\npip install -e .\n```\n\nInstall dependencies with python 3.10 by running \n```\nbash install.sh\n```\nor use our [docker image](https://hub.docker.com/layers/qhwang123/researchassistant/latest/images/sha256-6b3690a13ba44fd089086e9860a298ed49a179d9a04a5406c0df074569a3aabe?context=repo). Since agent will modify and execute files, we recommend running experiments within sandboxes such as docker container.\nFor docker, use the following instructions: \n1. Pull the docker image:\n```\ndocker pull qhwang123/researchassistant:latest\n```\n2. Run the docker container from the image, mounting the current directory to `/MLAgentBench` inside the container with root user permissions to install other packages:\n- On Windows PowerShell\n```\ndocker run -it --user root -v ${PWD}:/MLAgentBench -w /MLAgentBench qhwang123/researchassistant:latest\n```\n- On Mac or Linux\n```\ndocker run -it --user root -v \"$(pwd)\":/MLAgentBench -w /MLAgentBench qhwang123/researchassistant:latest\n```\n\nEach dataset will be prepared when it is run the first time. You can also prepare them beforehand with \n```\npython -u -m MLAgentBench.prepare_task \u003ctask_name\u003e $(which python)\n```\nFor Kaggle datasets, you need to set up Kaggle API and authentication (~/.kaggle/kaggle.json) as described [here](https://www.kaggle.com/docs/api). You may also need to provide manual consent to the rules of specific competitions by following the prompts. For docker, use the following instructions:\n1. Ensure that you have \".kaggle/kaggle.json\" with your API credentials in the MLAgentBench root folder.\n2. Once your container is mounted (instructions above), run\n```\nexport KAGGLE_CONFIG_DIR=/MLAgentBench/.kaggle\npip install kaggle\nsudo apt-get install unzip\n```\n\nFinally, put API keys under the root directory of this repo (or wherever you run scripts from). Currently, we support OpenAI (openai_api_key.txt in the format of organization:APIkey), Claude (claude_api_key.txt), and CRFM API (crfm_api_key.txt). To use an AutoGPT agent, setup the directory as described [here](https://docs.agpt.co/setup/).\n\nUpdate: We support gemini pro and huggingface now! To run gemini, fill in PROJECT_ID in LLM.py to your project id. To run huggingface, specifiy model as huggingface/\u003corg name\u003e/\u003cmodel name\u003e.\n\n# Quick Start\n\nTo run our research agent on cifar10 task with openai API using gpt-4 and gpt-3.5-turbo:\n\n```\npython -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test  --work-dir workspace --llm-name gpt-4 --edit-script-llm-name gpt-4 --fast-llm-name gpt-3.5-turbo \u003e  first_test/log 2\u003e\u00261\n```\n\nNote: capturing log is necessary for oom error etc runtime error detection.\n\nThis will produce logs in `first_test` directory with the following structure\n```\nfirst_test/\n    agent_log/\n        main_log # main log showing agent's research process\n        agent_*.json # saved agent states\n        ...\n    env_log/\n        tool_logs/ \n        traces/ # snap shots of the agent workspace\n        trace.json # interaction trace of the agent\n        overall_time.txt # overall time\n        error.txt # will be generated if there is a system error\n```\n\nIf llm names are not specified in the args, we use claude-v1 model by default for all LLM calls. See example logs with GPT-4 over cifar10 [here](https://drive.google.com/drive/folders/1Ozy_zKYdvwcSq3EFnkaudgUXKJmBwQ5t?usp=drive_link).\n\n# Evaluation\n\nTo run evaluation:\n```\npython -m MLAgentBench.eval --log-folder \u003clog_folder\u003e  --task \u003ctask_name\u003e --output-file \u003coutput_name\u003e\n```\n\nThis will evaluate all runs under \u003clog_folder\u003e as a json.\n\nTo run baseline, run the trivial policy of directly running train.py then submit with ``--agent_type Agent`` as in baseline.sh:\n\n```\npython -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test  --work-dir workspace --agent_type Agent\n```\n\nFinally, to reproduce plots with jsons genereated, run plot.py in MLAgentBench.\n\n# Workflow\n\nTo run the benchmark systematically, we recommend the following workflow:\n\n1. Run parallel experiments over different tasks and different agents using `run_experiments.sh`. This will generate log folders in structure of final_exp_logs/\u003cmodel_name\u003e/\u003crun_timestamp\u003e/...\n2. Run baseline.sh on all tasks to provide baselines.\n2. Run eval.sh with properly specified models and tasks to generate evaluation jsons, including baselines.\n3. Use plot.py in MLAgentBench to analyze the results. Note you need to fix some paths and names in the file as marked with TODO.\n\n# Tasks\n\nEach task is a folder in `MLAgentBench/benchmarks/`, under which the `env/` folder contains files that the research agent will see at the beginning, and `script/` folder contains additional hidden files such as `prepare.py` for downloading data and `eval.py` for evaluation.\n\n# Agents\n\nWe currently support variants of our research agent along with langchain and autogpt agents. See `run_experiments.sh` for their commands.\n\n# Results\nSuccess Rate, i.e. the percentages of runs that achieve more than 10% improvement at the\nlast step over the average performance of the baseline in starter code:\n![](figs/final_improve_10.png)\n\n\n\nAverage Improvement over the baseline in starter code among the runs that made a valid\nsubmission at the last step:\n![](figs/final_improve.png)\n\nSee all logs here: https://github.com/q-hwang/MLAgentBench_logs\n\n# Interactive Mode (Under construction)\n\nYou can also specify tasks interactively to the research agent by running `research_agent_interactive.sh`, or ideally as a vscode extension.\n\n","funding_links":[],"categories":["Evaluation-Benchmarks","🔬 Autonomous Research \u0026 Self-Improving Agents","A01_文本生成_文本对话","Anthropomorphic-Taxonomy"],"sub_categories":["Evaluation \u0026 Benchmarks","大语言对话模型及数据","Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsnap-stanford%2FMLAgentBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsnap-stanford%2FMLAgentBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsnap-stanford%2FMLAgentBench/lists"}