{"id":30527975,"url":"https://github.com/ByteDance-Seed/m3-agent","last_synced_at":"2025-08-27T04:03:08.391Z","repository":{"id":309566401,"uuid":"1029059614","full_name":"ByteDance-Seed/m3-agent","owner":"ByteDance-Seed","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-20T02:44:30.000Z","size":6527,"stargazers_count":438,"open_issues_count":3,"forks_count":35,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-08-20T04:43:39.733Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ByteDance-Seed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-30T13:12:32.000Z","updated_at":"2025-08-20T04:41:17.000Z","dependencies_parsed_at":"2025-08-12T16:35:22.143Z","dependency_job_id":"835eef9c-c7cc-44db-8636-daf44b8c75f1","html_url":"https://github.com/ByteDance-Seed/m3-agent","commit_stats":null,"previous_names":["bytedance-seed/m3-agent"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ByteDance-Seed/m3-agent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2Fm3-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2Fm3-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2Fm3-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2Fm3-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ByteDance-Seed","download_url":"https://codeload.github.com/ByteDance-Seed/m3-agent/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2Fm3-agent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272288927,"owners_count":24907776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-27T02:00:09.397Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-27T04:01:51.630Z","updated_at":"2025-08-27T04:03:08.359Z","avatar_url":"https://github.com/ByteDance-Seed.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=left\u003e\n    \u003cimg src=\"https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216\" width=40%\u003e\n\u003c/div\u003e\n\n\u003ch1 style=\"text-align: center;\"\u003eSeeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory\u003c/h1\u003e\n\n[![arXiv](https://img.shields.io/badge/arXiv-2508.09736-b31b1b.svg)](https://arxiv.org/abs/2508.09736)\n[![Demo](https://img.shields.io/badge/homepage-M3--Agent-blue)](https://m3-agent.github.io)\n[![Model](https://img.shields.io/badge/model_HF-Memorization-green)](https://huggingface.co/ByteDance-Seed/M3-Agent-Memorization)\n[![Model](https://img.shields.io/badge/model_HF-Control-darkgreen)](https://huggingface.co/ByteDance-Seed/M3-Agent-Control)\n[![Data](https://img.shields.io/badge/data-M3--Bench-F9D371)](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench)\n\n## Abstract\n\nWe introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 8.2%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design.\n\n![illustration](figs/illustration.png)\n\n## A demo of M3-Agent as a personal assistant!\n\n[![Watch the video](figs/demo.png)](https://www.youtube.com/watch?v=XUx31cBanfo)\n\nThe video can also be accessed on [Bilibili](https://www.bilibili.com/video/BV1h9YpznEx9/)\n\n## M3-Bench\n\nWe introduce M3-Bench, an long video question answerin dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets:\n1. M3-Bench-robot, which contains 100 real-world videos recorded from a robot's first-person perspective, \n2. M3-Bench-web, which includes 920 web-sourced videos covering a wider variety of content and scenarios. \n\n![architecture](figs/m3-bench-example.png)\\\n[link1](https://www.youtube.com/watch?v=7W0gRqCRMZQ), [link2](https://www.youtube.com/watch?v=Efk3K4epEzg), [link3](https://www.youtube.com/watch?v=6Unxpxy-Ct4)\\\nExamples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.\n\n![architecture](figs/m3-bench-statistic.png)\n\nStatistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.\n\n### Videos\n\n1. Download M3-Bench-robot from [huggingface](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench/tree/main/videos/robot)\n2. Download M3-Bench-web from video_url in `data/annotations/web.json`\\\n\n### Intermediate Outputs\n\n**[optional]** You can either download the intermediate outputs we have processed from [huggingface](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench/tree/main/intermediate_outputs) or generate them directly from the video by the following steps.\n\n### Memory Graphs\n\n**[optional]** You can either download and extract the memory graphs we have processed from [huggingface](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench/tree/main/memory_graphs) or generate them directly from the video by the following steps.\n\n## M3-Agent\n\n![architecture](figs/m3-agent.png)\n\nArchitecture of M3-Agent. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively thinking and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.\n\n## Experimental Results\n\n![architecture](figs/exp_result.png)\n\nResults on M3-Bench-robot, M3-Bench-web, and VideoMME-long.\n\n## Run Locally\n\n\u003e Before running, add api config in `configs/api_config.json`\n\n### Memorization\n\nGenerate memory graphs for each video. The results are saved in `data/memory_graphs`.\n\n- The following steps are required only if you haven't downloaded *intermediate_outputs* and *memory_graphs* from huggingface or want to process other videos not from M3-Bench.\n\n1. Set up environment\n\n```bash\nbash setup.sh\npip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8\npip install qwen-omni-utils==0.0.4\n```\n\n2. Cut Video\n\n   Cut the video into 30 second segments.\n\n```bash\n#!/bin/bash\n\nvideo=\"robot/bedroom_01\"\ninput=\"data/videos/$video.mp4\"\nmkdir -p \"data/clips/$video\"\nduration=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 \"$input\")\nduration_seconds=$(echo \"$duration\" | awk '{print int($1)}')\n \nsegments=$((duration_seconds / 30 + 1))\nfor ((i=0; i\u003csegments; i++)); do\n    start=$((i * 30))\n    end=$(((i + 1) * 30))\n    output=\"data/clips/$video/$i.mp4\"\n    ffmpeg -ss $start -i \"$input\" -t 30 -c copy \"${output}\"\ndone\n```\n\n3. Prepare data\n\nPrepare a jsonl file with one video per line saved in `data/data.jsonl`\n\n```json\n{\"id\": \"bedroom_01\", \"video_path\": \"data/videos/robot/bedroom_01.mp4\", \"clip_path\": \"data/videos/clips/bedroom_01\", \"mem_path\": \"data/videos/memory_graphs/bedroom_01.pkl\", \"intermediate_path\": \"data/videos/intermediate_outputs/robot/bedroom_01\"}\n```\n\n\n4. Generate Intermediate Outputs\n\n   **This step uses Face Detection and Speaker Diarization tools to generate intermediate outputs.**\n\n   - If you want to use M3-Bench and have downloaded intermediate_outputs from huggingface, you can skip this step.\n\n   - Download audio embedding model and save into `models\\` from [pretrained_eres2netv2.ckpt](https://www.modelscope.cn/models/iic/speech_eres2netv2_sv_zh-cn_16k-common/resolve/master/pretrained_eres2netv2.ckpt)\n\n   - Download [speakerlab](https://github.com/modelscope/3D-Speaker/tree/main/speakerlab)\n\n   ```\n   m3-agent\n   ├── models\n   │   └── pretrained_eres2netv2.ckpt\n   └── speakerlab\n   ```\n\n```bash\npython m3_agent/memorization_intermediate_outputs.py \\\n   --data_file data/data.jsonl\n```\n\n5. Generate Memory Graphs\n\n   **This step uses the M3-Agent-Memorization model to generate memory graphs.**\n\n   - Download M3-Agent-Memorization from [huggingface](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench/tree/main/videos/robot)\n\n```bash\npython m3_agent/memorization_memory_graphs.py \\\n   --data_file data/data.jsonl\n```\n\n6. Memory Graph Visualization\n\n```bash\npython visualization.py \\\n   --mem_path data/memory_graphs/robot/bedroom_01.pkl \\\n   --clip_id 1\n```\n\n### Control\n\n1. Set up environment\n\n```bash\nbash setup.sh\npip install transformers==4.51.0\npip install vllm==0.8.4\npip install numpy==1.26.4\n```\n\n2. Question Answering and Evaluation\n\n   **This step uses the M3-Agent-Control model to generate answer and the GPT-4o to evaluate the answer.**\n\n   - Download M3-Agent-Control from [huggingface](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench/blob/main/videos/robot)\n\n```bash\npython m3_agent/control.py \\\n   --data_file data/annotations/robot.json\n```\n\n### Other Models\n\nIf you want to prompt other models to generate memory or answer question, only need to change the model inference into api calling and use the corresponding prompt.\n\nPrompts:\n\n1. Memorization\n   - Gemini/GPT-4o: `mmagent.prompts.prompt_generate_captions_with_ids`\n   - Qwen2.5-Omni-7B: `mmagent.prompts.prompt_generate_full_memory`\n\n2. Control\n   - GPT-4o: `mmagent.prompts.prompt_answer_with_retrieval_final`\n\n\n## Training\n\n1. Memorization: https://github.com/hyc2026/sft-qwen2.5-omni-thinker\n2. Control: https://github.com/hyc2026/M3-Agent-Training\n\n## Citation\nPlease cite us as:\n\n```BibTeX\n@misc{long2025seeing,\n      title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory}, \n      author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},\n      year={2025},\n      eprint={2508.09736},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FByteDance-Seed%2Fm3-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FByteDance-Seed%2Fm3-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FByteDance-Seed%2Fm3-agent/lists"}