{"id":50510427,"url":"https://github.com/Tencent-Hunyuan/HY-Embodied","last_synced_at":"2026-06-19T14:00:39.441Z","repository":{"id":350137634,"uuid":"1202005530","full_name":"Tencent-Hunyuan/HY-Embodied","owner":"Tencent-Hunyuan","description":"HY-Embodied: Embodied Foundation Models for Real-World Agents","archived":false,"fork":false,"pushed_at":"2026-06-18T07:15:47.000Z","size":11715,"stargazers_count":749,"open_issues_count":12,"forks_count":14,"subscribers_count":4,"default_branch":"master","last_synced_at":"2026-06-18T09:13:42.895Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tencent-Hunyuan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-05T13:20:23.000Z","updated_at":"2026-06-18T07:15:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Tencent-Hunyuan/HY-Embodied","commit_stats":null,"previous_names":["tencent-hunyuan/hy-embodied"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Tencent-Hunyuan/HY-Embodied","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tencent-Hunyuan%2FHY-Embodied","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tencent-Hunyuan%2FHY-Embodied/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tencent-Hunyuan%2FHY-Embodied/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tencent-Hunyuan%2FHY-Embodied/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tencent-Hunyuan","download_url":"https://codeload.github.com/Tencent-Hunyuan/HY-Embodied/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tencent-Hunyuan%2FHY-Embodied/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34534278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-02T20:00:26.252Z","updated_at":"2026-06-19T14:00:39.436Z","avatar_url":"https://github.com/Tencent-Hunyuan.png","language":"Python","funding_links":[],"categories":["🏭 Industrial / Production Model Reports"],"sub_categories":["🔁 Iterative Self-Bootstrapping"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eHY-Embodied\u003c/h1\u003e\n\u003cp\u003e\u003cb\u003eA Family of Embodied Foundation Models for Real-World Agents\u003c/b\u003e\u003c/p\u003e\n\u003cp\u003e\u003ci\u003eTencent Robotics X × HY Vision Team\u003c/i\u003e\u003c/p\u003e\n\n\u003ca href=\"hy_embodied_tech_report.pdf\"\u003e\u003cimg src=\"https://img.shields.io/badge/PDF-Report-green?logo=report\" alt=\"Tech Report\"\u003e\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/abs/2604.07430\"\u003e\u003cimg src=\"https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv\" alt=\"arXiv\"\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/tencent/HY-Embodied-0.5/tree/main\"\u003e\u003cimg src=\"https://img.shields.io/badge/Models-HuggingFace-yellow?logo=huggingface\" alt=\"Models\"\u003e\u003c/a\u003e\n\u003ca href=\"https://x.com/TencentHunyuan/status/2042503238877135336?s=20\"\u003e\u003cimg src=\"https://img.shields.io/badge/Post-X-black?logo=x\u0026logoColor=white\" alt=\"X\"\u003e\u003c/a\u003e\n\n\u003c/div\u003e\n\nhttps://github.com/user-attachments/assets/a5c6b872-2cb0-4f52-8321-894fee7da27e\n\n## 🔥 Updates\n\n  * **`[2026-06-15]`** 🤖 We have released **HY-VLA-0.5**! The [official code](https://github.com/Tencent-Hunyuan/Hy-Embodied-0.5-VLA), UMI-trained [weights](https://huggingface.co/tencent/Hy-Embodied-0.5-VLA-UMI) and 2000+ hours of high-fidelity UMI [data](https://huggingface.co/datasets/tencent/Hy-Embodied-0.5-VLA-Data) are now available.\n   \n  * **`[2026-04-09]`** 🚀 We have released **HY-Embodied-0.5**, featuring the open-sourced `HY-Embodied-0.5 MoT-2B` weights on [Hugging Face](https://huggingface.co/tencent/HY-Embodied-0.5/tree/main) along with the official inference code\\!\n\n## 📖 Abstract\n\nWe introduce **HY-Embodied-0.5**, a suite of foundation models tailored specifically for real-world embodied intelligence. To bridge the gap between general Vision-Language Models (VLMs) and the strict demands of physical agents, our models are engineered to excel in spatial-temporal visual perception and complex embodied reasoning (prediction, interaction, and planning).\n\nThe suite features an innovative **Mixture-of-Transformers (MoT)** architecture utilizing latent tokens for modality-specific computing, significantly enhancing fine-grained perception. It includes two primary variants: a highly efficient **2B model** for edge deployment and a powerful **32B model** for complex reasoning. Through a self-evolving post-training paradigm and large-to-small on-policy distillation, our compact MoT-2B outperforms state-of-the-art models of similar size across 16 benchmarks, while the 32B variant achieves frontier-level performance comparable to Gemini 3.0 Pro. Ultimately, HY-Embodied serves as a robust \"brain\" for Vision-Language-Action (VLA) pipelines, delivering compelling results in real-world physical robot control.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/teaser.png\" alt=\"HY-Embodied Teaser\" width=\"85%\"\u003e\n\u003c/div\u003e\n\n## ⭐️ Key Features\n\n  * 🧠 **Evolved MoT Architecture:** Designed for maximum efficiency without sacrificing visual acuity. The MoT-2B variant contains 4B total parameters but requires **only 2.2B activated parameters** during inference. By emphasizing modality-specific computing in the vision pathway, it achieves the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations.\n  * 🔗 **High-Quality Mixed Chain Reasoning:** We introduce an advanced iterative, self-evolving post-training pipeline. By employing on-policy distillation, we successfully transfer the sophisticated step-by-step reasoning, planning, and high-quality \"thinking\" capabilities from our powerful 32B model directly to the compact 2B variant.\n  * 🌍 **Large-Scale Embodied Pre-training:** Grounded in a massive, specially curated dataset comprising **\\\u003e100 million** embodied and spatial-specific data points. Trained on a corpus exceeding **200 billion tokens**, the model develops a deep, native understanding of 3D spaces, physical object interactions, and agent dynamics.\n  * 🦾 **Stronger VLA Application:** Beyond standard academic benchmarks, HY-Embodied is engineered to be the core cognitive engine for physical robots. It seamlessly integrates into Vision-Language-Action (VLA) frameworks, acting as a highly robust and capable brain to drive high success rates in complex, real-world robotic control tasks.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/arch.png\" alt=\"HY-Embodied Architecture\" width=\"85%\"\u003e\n\u003c/div\u003e\n\n## 📅 Plannings\n\n- [x] Transformers Inference\n- [x] Fine-tuning Code\n- [ ] vLLM Inference\n- [ ] Online Gradio Demo\n\n## 🛠️ Dependencies and Installation\n\n### Prerequisites\n\n- 🖥️ **Operating System**: Linux (recommended)\n- 🐍 **Python**: 3.12+ (recommended and tested)\n- ⚡ **CUDA**: 12.6\n- 🔥 **PyTorch**: 2.8.0\n- 🎮 **GPU**: NVIDIA GPU with CUDA support\n\n### Installation\n\n1. **Install the specific Transformers version required for this model:**\n```bash\npip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a\n```\n\n\u003e **Note**: We will merge the improvements into the Transformers main branch later.\n\n2. **Install other dependencies:**\n```bash\npip install -r requirements.txt\n```\n\n### Quick Start\n\n1. **Clone the repository:**\n```bash\ngit clone https://github.com/Tencent-Hunyuan/HY-Embodied\ncd HY-Embodied/\n```\n\n2. **Install dependencies:**\n```bash\npip install -r requirements.txt\n```\n\n3. **Run inference:**\n```bash\npython inference.py\n```\n\nThe example script demonstrates both single generation and batch generation capabilities.\n\n### Model Download\n\nThe code automatically downloads the model `tencent/HY-Embodied-0.5` from Hugging Face Hub. Ensure you have sufficient disk space (8 GB) for the model weights.\n\n### Hardware Requirements\n\n- **GPU**: Recommended for optimal performance (NVIDIA GPU with at least 16GB VRAM)\n- **CPU**: Supported but slower\n- **Memory**: At least 16GB RAM recommended\n- **Storage**: 20GB+ free space for model and dependencies\n\n### Coordinate \u0026 Response Format\n\nThe model uses the following coordinate representations:\n\n- **Point**: `\u003cpoint\u003e(x, y)\u003c/point\u003e`, or a list of points `[\u003cpoint\u003e(x1, y1)\u003c/point\u003e, \u003cpoint\u003e(x2, y2)\u003c/point\u003e]`\n- **Box**: `\u003cbox\u003e[xmin, ymin, xmax, ymax]\u003c/box\u003e` or a list of boxes\n\nAll coordinates are normalized to integers in the range **(0, 1000)**.\n\nThe model's response follows a structured thinking format:\n\n```\n\u003cthink\u003e\\n[thinking content]\\n\u003c/think\u003e\\n\u003canswer\u003e\\n[answer content]\\n\u003c/answer\u003e\n```\n\n## 🚀 Quick Start with Transformers\n\n### Basic Inference Example\n\n```python\nimport os\nimport torch\nfrom transformers import AutoModelForImageTextToText, AutoProcessor\n\n# Load model \u0026 processor\nMODEL_PATH = \"tencent/HY-Embodied-0.5\"\nDEVICE = \"cuda\"\n\n# Non-Thinking Mode\nTHINKING_MODE = False\n# Thinking Mode\nTHINKING_MODE = True\n\nTEMPERATURE = 0.05\n\nprocessor = AutoProcessor.from_pretrained(MODEL_PATH)\n\n# Load chat template if available\nchat_template_path = os.path.join(MODEL_PATH, \"chat_template.jinja\")\nif os.path.exists(chat_template_path):\n    processor.chat_template = open(chat_template_path).read()\n\nmodel = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)\nmodel.to(DEVICE).eval()\n\n# Prepare input messages\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"image\", \"image\": \"./figures/example.jpg\"},\n            {\"type\": \"text\", \"text\": \"Describe the image in detail.\"},\n        ],\n    }\n]\n\n# Process and generate\ninputs = processor.apply_chat_template(\n    messages,\n    tokenize=True,\n    add_generation_prompt=True,\n    return_dict=True,\n    return_tensors=\"pt\",\n    enable_thinking=THINKING_MODE,\n).to(model.device)\n\nwith torch.no_grad():\n    generated_ids = model.generate(\n        **inputs,\n        max_new_tokens=32768,\n        use_cache=True,\n        temperature=TEMPERATURE,\n        do_sample=TEMPERATURE \u003e 0,\n    )\n\noutput_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]\nprint(processor.batch_decode(output_ids, skip_special_tokens=True)[0])\n```\n\n### Batch Inference\n\n```python\nimport os\nimport torch\nfrom transformers import AutoModelForImageTextToText, AutoProcessor\n\n# Load model \u0026 processor\nMODEL_PATH = \"tencent/HY-Embodied-0.5\"\nDEVICE = \"cuda\"\n\n# Non-Thinking Mode\nTHINKING_MODE = False\n# Thinking Mode\nTHINKING_MODE = True\n\nTEMPERATURE = 0.5\n\nprocessor = AutoProcessor.from_pretrained(MODEL_PATH)\n\n# Load chat template if available\nchat_template_path = os.path.join(MODEL_PATH, \"chat_template.jinja\")\nif os.path.exists(chat_template_path):\n    processor.chat_template = open(chat_template_path).read()\n\nmodel = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)\nmodel.to(DEVICE).eval()\n\n# Batch Inference (multiple prompts at once)\nmessages_batch = [\n    # Sample A: image + text\n    [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image\", \"image\": \"./figures/example.jpg\"},\n                {\"type\": \"text\", \"text\": \"Describe the image in detail.\"},\n            ],\n        }\n    ],\n    # Sample B: text only\n    [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"How to open a fridge?\"},\n            ],\n        }\n    ],\n]\n\n# Process each message independently\nall_inputs = []\nfor msgs in messages_batch:\n    inp = processor.apply_chat_template(\n        msgs,\n        tokenize=True,\n        add_generation_prompt=True,\n        return_dict=True,\n        return_tensors=\"pt\",\n        enable_thinking=THINKING_MODE,\n    )\n    all_inputs.append(inp)\n\n# Left-pad and batch\nbatch = processor.pad(all_inputs, padding=True, padding_side=\"left\").to(model.device)\n\nwith torch.no_grad():\n    batch_generated_ids = model.generate(\n        **batch,\n        max_new_tokens=32768,\n        use_cache=True,\n        temperature=TEMPERATURE,\n        do_sample=TEMPERATURE \u003e 0,\n    )\n\n# Decode: strip the padded input portion\npadded_input_len = batch[\"input_ids\"].shape[1]\nfor i, msgs in enumerate(messages_batch):\n    out_ids = batch_generated_ids[i][padded_input_len:]\n    print(f\"\\n--- Sample {i} ---\")\n    print(processor.decode(out_ids, skip_special_tokens=True))\n```\n\n## 📊 Evaluation\n\n### Visual Perception\n\n\u003e **Note**: We evaluated HY-Embodied-0.5 MoT-2B across 22 embodied-relevant benchmarks against models of similar size. For detailed performance metrics and methodology, please refer to our technical report.\n\n| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |\n|-----------|------------------------|-------------|-------------|------------------|------------------|\n| CV-Bench  | **89.2** | 80.0 | 85.7 | 86.9 | 88.8 |\n| DA-2K     | **92.3** | 69.5 | 76.5 | 79.4 | 72.2 |\n\n### Embodied Understanding\n\n| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |\n|-----------|------------------------|-------------|-------------|------------------|------------------|\n| ERQA | **54.5** | 41.8 | 47.3 | 43.3 | 46.8 |\n| EmbSpatial-Bench | **82.8** | 75.9 | 80.7 | 73.8 | 76.2 |\n| RoboBench-MCQ | **49.2** | 36.9 | 45.8 | 44.4 | 43.6 |\n| RoboBench-Planning | 54.2 | 36.2 | 36.4 | 39.2 | **58.7** |\n| RoboSpatial-Home | 55.7 | 45.3 | **63.2** | 62.3 | 61.8 |\n| ShareRobot-Aff. | **26.8** | 19.8 | 25.5 | 25.5 | 9.0 |\n| ShareRobot-Traj. | 73.3 | 41.6 | 62.2 | **81.4** | 50.6 |\n| Ego-Plan2 | 45.5 | 35.5 | 38.8 | **52.6** | 39.9 |\n\n### Spatial Understanding\n\n| Benchmark | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |\n|-----------|------------------------|-------------|-------------|------------------|------------------|\n| 3DSRBench | **57.0** | 39.9 | 43.9 | 44.8 | 42.0 |\n| All-Angles Bench | **55.1** | 42.3 | 46.7 | 43.8 | 49.0 |\n| MindCube | **66.3** | 28.4 | 31.0 | 26.9 | 36.2 |\n| MMSI-Bench | **33.2** | 23.6 | 25.1 | 20.5 | 31.9 |\n| RefSpatial-Bench | 45.8 | 28.9 | 45.3 | **56.0** | 48.0 |\n| SAT | 76.7 | 45.3 | 56.7 | 51.3 | **78.7** |\n| SIBench-mini | **58.2** | 42.0 | 50.9 | 47.3 | 53.1 |\n| SITE-Bench-Image | **62.7** | 52.3 | 61.0 | 57.9 | 49.9 |\n| SITE-Bench-Video | **63.5** | 52.2 | 58.0 | 54.8 | 58.9 |\n| ViewSpatial | **53.1** | 37.2 | 41.6 | 36.6 | 36.1 |\n| VSIBench | **60.5** | 48.0 | 55.2 | 41.7 | 48.5 |\n| Where2Place | **68.0** | 45.0 | 59.0 | 65.0 | 63.6 |\n\n*Note: Results for HY-Embodied-0.5 MoT-2B are reported in thinking mode, while for all other models, we report the better performance between non-thinking and thinking modes.*\n\n## 📚 Citation\n\nIf you find it useful for your research and applications, please cite our paper using this BibTeX:\n```bibtex\n@article{tencent2026hyembodied05,\ntitle={HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents},\nauthor={Team, HY and Yu, Xumin and Liu, Zuyan and Wang, Ziyi and Zhang, He and Rao, Yongming and Liu, Fangfu and Zhang, Yani and Zhao, Ruowen and Wang, Oran and others},\njournal={arXiv preprint arXiv:2604.07430},\nyear={2026}\n}\n}\n```\n\n\n## 🙏 Acknowledgements\n\nWe thank the Hugging Face community for their support and the open-source contributions that made this implementation possible.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencent-Hunyuan%2FHY-Embodied","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTencent-Hunyuan%2FHY-Embodied","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencent-Hunyuan%2FHY-Embodied/lists"}