{"id":48585030,"url":"https://github.com/zhouzypaul/auto_eval","last_synced_at":"2026-04-08T17:43:19.756Z","repository":{"id":285423989,"uuid":"955625973","full_name":"zhouzypaul/auto_eval","owner":"zhouzypaul","description":"AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World | CoRL 2025","archived":false,"fork":false,"pushed_at":"2026-03-26T17:39:50.000Z","size":1131,"stargazers_count":95,"open_issues_count":1,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-27T07:25:11.036Z","etag":null,"topics":["foundation-models","robot-learning"],"latest_commit_sha":null,"homepage":"https://auto-eval.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhouzypaul.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-26T23:57:46.000Z","updated_at":"2026-03-26T17:39:54.000Z","dependencies_parsed_at":"2025-03-31T17:43:34.574Z","dependency_job_id":"03c48f99-330d-469f-85f2-d360f389178a","html_url":"https://github.com/zhouzypaul/auto_eval","commit_stats":null,"previous_names":["zhouzypaul/auto_eval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zhouzypaul/auto_eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhouzypaul%2Fauto_eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhouzypaul%2Fauto_eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhouzypaul%2Fauto_eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhouzypaul%2Fauto_eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhouzypaul","download_url":"https://codeload.github.com/zhouzypaul/auto_eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhouzypaul%2Fauto_eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31567226,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["foundation-models","robot-learning"],"created_at":"2026-04-08T17:43:19.093Z","updated_at":"2026-04-08T17:43:19.736Z","avatar_url":"https://github.com/zhouzypaul.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AutoEval\n\n[![Paper](https://img.shields.io/badge/arXiv-2503.24278-df2a2a.svg?style=for-the-badge)](https://arxiv.org/abs/2503.24278)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)\n[![Static Badge](https://img.shields.io/badge/Project-Page-a?style=for-the-badge)](https://auto-eval.github.io/)\n\nCode Release for [AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World](https://auto-eval.github.io/assets/paper.pdf). Check out [auto-eval.github.io](https://auto-eval.github.io/) to access the open-access evaluation dashboard on WidowX robots and for instructions of how to get your own policies evaluated by AutoEval. You can host your policy as a server and pass along the IP and port to the dashboard and submit an evaluation job in minutes.\n\nThe [website](https://auto-eval.github.io/) contains all the details on submitting jobs to our Bridge-AutoEval stations with four different tasks. The instructions below are for setting up a new AutoEval station locally for a new task, and hosting a dashboard for policy submission.\n\n![teaser](https://auto-eval.github.io/assets/teaser.png)\n\n## Installations\n\nCreate your conda environment:\n```bash\nconda create -n autoeval python=3.10 -y\nconda activate autoeval\npip install -r requirements.txt\npip install -e .\n```\n\nYou will also need the following dependencies:\n - `manipulator_gym` for the robot environment: https://github.com/rail-berkeley/manipulator_gym\n - `agentlace` for distributed policy and robot environments: https://github.com/youliangtan/agentlace\n - `robot_eval_logger` for logging: https://github.com/zhouzypaul/robot_eval_logger. Please use the `auto_eval` branch instead of `main`.\n\nOther optional packages:\n - `jaxrl_m` (Optional, for jaxrl goal-conditioned policy): https://github.com/rail-berkeley/soar/tree/main/model_training\n - `susie` (Optional, for SuSIE/SOAR policy): https://github.com/kvablack/susie\n - `simpler_env` (Optional, for sim evaluation in SIMPLER): https://github.com/youliangtan/SimplerEnv\n\nWe use a slack bot to send automated messages to a slack channel when human intervention is required in AutoEval. To use the slack bot, you need to create a slack app (see [here](https://help.thebotplatform.com/en/articles/7233667-how-to-create-a-slack-bot) for instructions), give in write permission to the channel, and set environment variables:\n```bash\nexport SLACK_BOT_TOKEN=\u003cTOKEN\u003e  # e.g. xoxb-...\nexport SLACK_CHANNEL_ID=\u003cCHANNEL_ID\u003e  # e.g. C06...\n```\nIf you don't want to use the slack bot, you can use the `--no_slack_bot` flag in `run_eval.py`, which creates a dummy bot that prints out messages in the terminal instead of sending them to slack.\n\n## Quick Start\n### Setting Up the Robot Environment\nWe use [manipulator_gym](https://github.com/rail-berkeley/manipulator_gym) and [agentlace](https://github.com/youliangtan/agentlace) to distribute the robot gym-like environment and policy execution (as illustrated below). The robot environment is run on a robot server machine, which can be a lightweight machine (e.g. Intel NUC) that only needs to run ROS and simple python scripts.\n\n```mermaid\ngraph LR\n    A[Robot Driver] \u003c--ROS Topics--\u003e B[Manipulator_gym server]\n    B \u003c--agentlace--\u003e C[Gym Env \u003c-\u003e Policy]\n```\n\n```bash\n# 1. start ros services\nroslaunch interbotix_xsarm_control xsarm_control.launch robot_model:=wx250s use_rvix:=false\n\n# 2. start robot server\ncd manipulator_gym\npython3 manipulator_server.py --widowx --cam_ids 0\n```\n\nYou would also need to install the `interbotix_ros_arms` package for the WidowX robot.See [manipulator_gym's descriptions](https://github.com/rail-berkeley/manipulator_gym?tab=readme-ov-file#viperx-or-widowx) for more details.\n\n### Important Code Snippets\nBelow we describe the main evaluation script and the two ways to run policies: (1) locally (where this auto_eval package is run) or (2) remotely with a policy server-client setup.\n- [run_eval.py](run_eval.py): Main script for running evaluations.\n- [scripts/configs/eval_config.py](scripts/configs/eval_config.py): Configuration file for evaluations, contains the task and policy configurations. Add new entries here for setting up new tasks.\n- [auto_eval/robot/policy.py](auto_eval/robot/policy.py): Different robot policies that you can run locally, no need for policy server-client setup. Add new classes here for new policies.\n- [auto_eval/robot/policy_clients.py](auto_eval/robot/policy_clients.py): Different policy clients for when a policy is set up as a server remotely instead of run locally with `policy.py`. `OpenWebClient` is a generic policy client that can be used for any policy server that conforms to the AutoEval API.\n- [auto_eval/policy_server/*](auto_eval/policy_server): Pre-made policy servers for some SOTA generalist robot policies. Add new servers here for new policies.\n\n### Running a Human Eval\n```bash\n# \u003cROBOT_IP\u003e is the IP address of the robot machine that runs the robot environment\n# make sure to edit `scripts/configs/eval_config.py` to ensure the task is set up correctly and the policy client type is correct.\npython run_eval.py --robot_ip \u003cROBOT_IP\u003e --config scripts/configs/eval_config.py:open_drawer --policy_server_ip \u003cPOLICY_SERVER_IP\u003e --policy_server_port \u003cPOLICY_SERVER_PORT\u003e --human_eval\n```\n\n### Running an Automated Eval\n```bash\n# \u003cROBOT_IP\u003e is the IP address of the robot machine that runs the robot environment\n# make sure to edit `scripts/configs/eval_config.py` to ensure the task is set up correctly and the policy client type is correct.\npython run_eval.py --robot_ip \u003cROBOT_IP\u003e --config scripts/configs/eval_config.py:open_drawer --policy_server_ip \u003cPOLICY_SERVER_IP\u003e --policy_server_port \u003cPOLICY_SERVER_PORT\u003e\n```\nYou can also use the bash scripts under `scripts/launch_*.sh` to run evaluations for the five tasks defined in the paper.\n\n\n## Success Detector\nWe learn a success detector by fine-tuning the Paligemma VLM. We collect images and fine-tune the VLM in the forms of VQA questions (e.g. \"Is the drawer open?\") and train the model to output `yes/no`.\n\nYou must be authenticated to huggingface to use paligemma. To authenticate, check out the top of the page [here](https://huggingface.co/google/paligemma-3b-pt-224).\nThen, run\n```bash\nhuggingface-cli login\n```\n\n1. Collect images by tele-operating the robot. Save all images corresponding to a certain label in a pickle file.\n```bash\n# the default option uses keyboard to control the robot (key bindings will be printed out in the terminal)\n# input keyboard options in the visualizer window, not the terminal\n# you can also use `--use_spacemouse` to tele-operate the robot. Tested only with WidowX.\npython scripts/teleop.py --ip \u003cROBOT_IP\u003e --log_type pkl --log_dir ~/datasets/record-open_drawer.pkl\npython scripts/teleop.py --ip \u003cROBOT_IP\u003e --log_type pkl --log_dir ~/datasets/record-close_drawer.pkl\n```\n\n2. Finetune Paligemma with the collected images. [script/ft_paligemma.py](script/ft_paligemma.py) will look for specific file names in the `working_dir`. For example, for `--dataset_type drawer`, it will look for `record-open_drawer.pkl` and `record-close_drawer.pkl`. See [script/ft_paligemma.py](script/ft_paligemma.py) for details.\n```bash\npython scripts/ft_paligemma.py --working_dir ~/datasets/ --dataset_type drawer\n```\n\n3. Evaluate the fine-tuned Paligemma model.\n```bash\n# evaluate the fine-tuned checkpoint on held-out test set\npython scripts/ft_paligemma.py --working_dir ~/datasets --model_id ~/datasets/checkpoints/... --eval\n\n# teleop the robot and query the model to see where it succeeds/fails\n# you can collect more images on where the classifier fails\npython scripts/teleop.py --ip \u003cROBOT_IP\u003e --pg ~/datasets/checkpoints/...  # use p option in the visualizer window\n```\n\n4. Optional: \"Dagger\" and improve the classifier. In addition to tele-operating the robot and seeing the failure points, you can also run an automated evaluation, and collect all the images that are input to the classifier, and manually label them as additional training data.\n```bash\n# run the eval with --save_classifier_data\npython run_eval.py --save_classifier_data\n\n# manually filter and label the images\n# see filter_images.py for details\n# the output files will be saved in `--output_folder/positive.pkl` and `--output_folder/negative.pkl`. Move them to the `working_dir` to train the classifier.\npython scripts/filter_images.py --input_folder ~/auto_eval_log/... --output_folder ~/datasets/\n\n# In case you want to check the data you have collected and go through them and relabel manually, run\npython scripts/relabel_images.py --input_dir /path/to/dir/with/pickle/files --output_dir /path/to/output\n```\n\n\n## Reset Policy\n### Learned Reset Policy\nTo get a robust reset policy, we collect a small number of demos (about 50) and fine-tune [OpenVLA](https://github.com/openvla/openvla).\n\n1. Collect demonstrations with teleoperation. You can do so easily with keyboard/spacemouse. This will save the demos directly in RLDS format.\n```bash\n# default option is keyboard teleop (key bindings will be printed out in the terminal, use them in the visualizer window)\n# use --use_spacemouse to teleoperate with spacemouse. Tested only with WidowX.\npython scripts/teleop.py --ip \u003cROBOT_IP\u003e --log_dir ~/datasets/drawer-scene-demos --log_lang_text \"open the drawer\"\n```\n\nYou can also collect demonstrations with a VR headset as described by the [BridgeData V2 paper](https://github.com/rail-berkeley/bridge_data_robot?tab=readme-ov-file#data-collection). The default data collection code will save the demos in a raw format, and you would need to convert them to RLDS format with [dlimp](https://github.com/zhouzypaul/dlimp) to make them readable with the OpenVLA dataloader. In dlimp, set `TRAIN_PROPORTION=0.99` and `DEPTH=2`, and make sure the manually override the language instructions of these demos.\n```bash\ncd dlimp/rlds_converters/bridge_dataset\nCUDA_VISIBLE_DEVICES=\"\" tfds build --manual_dir ~/datasets/drawer-scene-demos\n```\n\n2. Fine-tune OpenVLA vis LoRA\nMake the following file structure:\n```bash\n~/checkpoints/auto-eval-openvla-drawer\n |_ checkpoints             # full merged model checkpoints\n |_ adapter_checkpoints     # adapter checkpoints\n |_ bridge_orig\n    |_ 1.0.0\n       |_ dataset_info.json\n       |_ features.json\n       |_ expert_demos-train.tfrecord....\n```\n\nMove the dataset to this new directory:\n```bash\nmv ~/tensorflow_datasets/bridge_dataset/ ~/checkpoints/auto-eval-openvla-drawer/bridge_orig\n```\nWe will treat these expert demos as the `bridge_orig` dataset, so we don't need to register the new dataset in the OpenVLA repo.\n\nTo start training on a single node:\n```bash\ntorchrun \\\n  --standalone \\\n  --nnodes 1 \\\n  --nproc-per-node 1 \\\n  scripts/ft_openvla.py \\\n  --batch_size 32 \\\n  --shuffle_buffer_size 1000 \\\n  --lora_rank 64 \\\n  --data_root_dir ~/checkpoints/auto-eval-openvla-drawer \\\n  --dataset_name bridge_orig \\\n  --run_root_dir ~/checkpoints/auto-eval-openvla-drawer/checkpoints \\\n  --adapter_tmp_dir ~/checkpoints/auto-eval-openvla-drawer/adapter_checkpoints \\\n  --use_quantization true \\\n  --save_steps 1000 \\\n  --max_steps 3000 \\\n  --wandb_project auto-eval-openvla-ft \\\n  --wandb_entity \u003cWANDB_ENTITY\u003e\n```\n\n3. Evaluate the fine-tuned policy\n```python\n# Option 1: Use the base OpenVLA model and pass in the LoRA adapters and the new dataset statistics json.\n# this will load the base OpenVLA model and merge in the local LoRA adapter with peft\nfrom auto_eval.robot.policy import OpenVLAPolicy\npolicy = OpenVLAPolicy(\n    lora_adapter_dir=\"~/checkpoints/auto-eval-openvla-drawer/adapter_checkpoints\",\n    dataset_stats_path=\"~/checkpoints/auto-eval-openvla-drawer/bridge_orig/1.0.0/dataset_info.json\",\n)\n```\n\n```bash\n# Option 2: Host an OpenVLA server with the merged model weights under `checkpoints`.\n# this will load the merged model weights from the `checkpoints` directory\npython auto_eval/policy_server/openvla_server.py --openvla_path ~/checkpoints/auto-eval-openvla-drawer/checkpoints\n```\n\nYou can run the evaluation with `run_eval.py`.\n\n\n### Scripted Policy\nFor some more structured environments, we also support using scripted policies as the reset policy. To script a policy, we record a tele-operated demonstration of the policy and replay it for resetting the environment.\n\nTo record a tele-operated demonstration, you can use the `teleop.py` script:\n```bash\npython scripts/teleop.py --ip \u003cROBOT_IP\u003e --log_type pkl --log_actions_only --log_dir scripted_policy.pkl\n```\nThen, use `auto_eval/robot/policy.py:RecordedPolicy` to replay the demonstration:\n```python\nfrom auto_eval.robot.policy import RecordedPolicy\npolicy = RecordedPolicy(\n    policy_save_path=\"scripted_policy.pkl\"\n)\n```\n\n## Running Policies Locally \u0026 Hosting Policy Servers\nIn the officially hosted [AutoEval](https://auto-eval.github.io), we use the server-client setup to evaluate policies: users must host their policy as remote servers, and AutoEval will connect to these servers with `OpenWebClient` to retrieve policy outputs.\n\nWhen setting up a new AutoEval station, you have two options of running policies:\n1. Run the policies locally (on the same machine as you run [run_eval.py](run_eval.py))\n2. Run policies remotely (on a different machine) as a server, and connect to it with a policy client in [run_eval.py](run_eval.py). This is recommended for resource-intensive policies.\n\n### Running Policies Locally\n[auto_eval/robot/policy.py](auto_eval/robot/policy.py) contains different policies that you can run locally. To use a policy, just import the policy class and pass in the required arguments. For example:\n```python\nfrom auto_eval.robot.policy import policies\npolicy = policies[\"openvla\"](\n  config={\n    \"lora_adapter_dir\": \"~/checkpoints/auto-eval-openvla-drawer/adapter_checkpoints\",\n    \"dataset_stats_path\": \"~/checkpoints/auto-eval-openvla-drawer/bridge_orig/1.0.0/dataset_info.json\",\n  }\n)\n```\n[run_eval.py](run_eval.py) and [scripts/configs/eval_config.py](scripts/configs/eval_config.py) also provides examples of using local policies.\nTo run your own policy, add additional classes to [auto_eval/robot/policy.py](auto_eval/robot/policy.py).\n\n### Running Policies Remotely\nThe policy server is a REST API server that accepts requests (with observation images, language instructions, proprio states) with the POST request and returns the 7-dim policy actions.\nThere are some example servers in [auto_eval/policy_server/*](auto_eval/policy_server/*). On the remote machine, start the server with:\n```bash\n# for example, to start the OpenVLA server\ncd auto_eval/policy_server/openvla_server\npython3 openvla_server.py\n```\n\nTo build your own policy server, follow the example in [auto_eval/policy_server/template.py](auto_eval/policy_server/template.py).\nYou can also build a state-ful server (e.g. one that keeps track of observation history or action chunks), see [auto_eval/policy_server/template_advanced.py](auto_eval/policy_server/template_advanced.py) for an example.\n\nTo connect to the policy server, you need to use a policy client in the AutoEval code:\n```python\nfrom auto_eval.robot.policy_clients import OpenWebClient\nclient = OpenWebClient(\n    policy_server_ip=...,\n    policy_server_port=...,\n)\n```\nMake sure that the machine running the AutoEval code can access the IP and port of the policy server (e.g. by ssh port forwarding or making the policy server public).\n\n\n## Web UI for Job Submission\n\nWe implement a job submission web UI (see official site [here](https://auto-eval.github.io)) with FastAPI in [index.html](static/index.html) and [job_scheduler.py](job_scheduler.py). The UI includes a job submission and status page, and a web viewer for live robot activities.\n\nTo start the server locally:\n```bash\nuvicorn job_scheduler:app --reload --host 0.0.0.0 --port 8080\n```\nThe web UI is available at http://localhost:8080/page.\n\n### Taking Robots \"Offline\"\nWe also add functionality to take robots \"offline\" (e.g. to prevent them from accepting new jobs) for maintenance or other purposes. Use [auto_eval/web_ui/robot_control.py](auto_eval/web_ui/robot_control.py) to take robots offline and bring it back online.\n```bash\n# View status of all robots\npython auto_eval/web_ui/robot_control.py status\n\n# Take a robot offline with a custom message\npython auto_eval/web_ui/robot_control.py offline widowx_drawer --message \"Under maintenance until tomorrow\"\npython auto_eval/web_ui/robot_control.py offline widowx_sink --message \"Hardware issue\"\n\n# Take all robots offline at once\npython auto_eval/web_ui/robot_control.py offline all --message \"System maintenance\"\n\n# Bring a robot back online\npython auto_eval/web_ui/robot_control.py online widowx_drawer\npython auto_eval/web_ui/robot_control.py online widowx_sink\n```\n\n\n## Eval with Simpler Env\n\nHere we provided `egg-plant-sink` and `drawer` SimplerEnv scenes that match the scenes in our custom auto eval. An example to run the simplerenv example, run the following:\n\nNOTE: this uses custom fork: https://github.com/youliangtan/SimplerEnv\n\n```bash\n# Test the simplerenv scenes\npython scripts/simpler_eval/eval_simpler.py --test --env widowx_open_drawer\npython scripts/simpler_eval/eval_simpler.py --test --env widowx_close_drawer\npython scripts/simpler_eval/eval_simpler.py --test --env widowx_put_eggplant_in_basket\npython scripts/simpler_eval/eval_simpler.py --test --env widowx_put_eggplant_in_sink\n\n# Openvla policy\npython scripts/simpler_eval/eval_simpler.py --env widowx_open_drawer --openvla --server_host localhost\n\n# octo policy\npython scripts/simpler_eval/eval_simpler.py --env widowx_open_drawer --octo\n\n# gcbc policy\npython scripts/simpler_eval/eval_simpler.py --env widowx_open_drawer --gcbc\n\n# susie policy\npython scripts/simpler_eval/eval_simpler.py --env widowx_open_drawer --susie --server_host localhost\n```\n\nChange the `--env` argument to run on different tasks.\n\n## Safety\n\n[manipulator_gym](https://github.com/rail-berkeley/manipulator_gym) provides a set of safety gym wrappers that can be used for extended robot operation on the WidowX robot:\n```python\nfrom manipulator_gym.utils.gym_wrappers import (\n    CheckAndRebootJoints,\n    ClipActionBoxBoundary,\n    InHouseImpedanceControl,\n    LimitMotorMaxEffort,\n)\n```\n\nTo set up the robot safety boundary, you can use the `--track_workspace_bounds` option in `scripts/teleop.py`. Then, teleoperate the robot to the maximum allowed robot workspace, and the maximum xyz coordinates will be recorded and printed out. Then, use the `ClipActionBoxBoundary` wrapper to clip the actions to the safety boundary.\n\n## Contributing\nTo enable code checks and auto-formatting, please install pre-commit hooks (run this in the root directory):\n```bash\npre-commit install\n\n# To run the checks manually\npre-commit run --all-files\n```\nThe hooks should now run before every commit. If files are modified during the checks, you'll need to re-stage them and commit again.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhouzypaul%2Fauto_eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhouzypaul%2Fauto_eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhouzypaul%2Fauto_eval/lists"}