{"id":25128756,"url":"https://github.com/Alibaba-NLP/OmniSearch","last_synced_at":"2025-10-23T08:31:15.572Z","repository":{"id":262594398,"uuid":"877657029","full_name":"Alibaba-NLP/OmniSearch","owner":"Alibaba-NLP","description":"Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent","archived":false,"fork":false,"pushed_at":"2025-01-13T00:02:39.000Z","size":13678,"stargazers_count":201,"open_issues_count":0,"forks_count":12,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-13T01:18:56.187Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Alibaba-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-24T02:39:51.000Z","updated_at":"2025-01-13T00:02:42.000Z","dependencies_parsed_at":"2025-01-13T01:18:52.812Z","dependency_job_id":"fbd1fb19-4cfc-4294-9009-ce3e1fa4da0b","html_url":"https://github.com/Alibaba-NLP/OmniSearch","commit_stats":null,"previous_names":["alibaba-nlp/omnisearch"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-NLP%2FOmniSearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-NLP%2FOmniSearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-NLP%2FOmniSearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-NLP%2FOmniSearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Alibaba-NLP","download_url":"https://codeload.github.com/Alibaba-NLP/OmniSearch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237801562,"owners_count":19368576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-08T12:01:31.275Z","updated_at":"2025-10-23T08:31:15.566Z","avatar_url":"https://github.com/Alibaba-NLP.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cdiv align=\"center\"\u003e\r\n\u003cimg src=\"IMG/logo.png\" width=\"55%\"\u003e\r\n\u003c/div\u003e\r\n\r\n# A Self-Adaptive Planning Agent For Multimodal RAG\r\n\r\n![](https://img.shields.io/badge/version-1.0.0-blue)[![Pytorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?e\u0026logo=PyTorch\u0026logoColor=white)](https://pytorch.org/)[![arxiv badge](https://img.shields.io/badge/arxiv-2411.02937-red)](https://arxiv.org/abs/2411.02937)\r\n\r\nRepo for [*Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent*](https://arxiv.org/abs/2411.02937)\r\n\r\nYou can visit the Omnisearch homepage by clicking [*here!*](https://alibaba-nlp.github.io/OmniSearch/)\r\n\r\n🌏 The **Chinese Web Demo** is avaiable at [ModelScope](https://modelscope.cn/studios/iic/OmniSearch/summary?header=default\u0026fullWidth=false) now！\r\n\r\n\u003cimg src=\"IMG/ask_test_2.5.gif\" width=\"799\" height=530\u003e\r\n\r\n- We propose OmniSearch, a self-adaptive retrieval agent that plans each retrieval action in real-time according to question solution stage and current retrieval content. As far as we known, **OmniSearch is the first planning agent for multimodal RAG.**\r\n- We reveal that existing VQA-based mRAG benchmarks fail to reflect the feature that real-world questions require dynamic knowledge retrieval, and propose novel **Dyn-VQA dataset, which contains three types of dynamic questions.**\r\n- We **benchmark various mRAG methods** with leading MLLMs on Dyn-VQA, demonstrating their flaw in providing sufficient and relevant knowledge for dynamic questions.\r\n\r\n\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"IMG/method4.jpg\" width=\"80%\" height=\"auto\" /\u003e\r\n\u003c/div\u003e\r\n\r\n\r\n\r\n## 💡 Perfomance\r\n\r\nThe performance of various MLLMs with different mRAG strategies are shown below:\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"IMG/main_result.jpg\" width=\"80%\" height=\"auto\" /\u003e\r\n\u003c/div\u003e\r\n\r\nMore analysis experiments can be found in the paper.\r\n\r\n# 📚 Dyn-VQA Dataset\r\n\r\nThe json item of Dyn-VQA dataset is organized in the following format:\r\n```json\r\n{\r\n    \"image_url\": \"https://www.pcarmarket.com/static/media/uploads/galleries/photos/uploads/galleries/22387-pasewark-1986-porsche-944/.thumbnails/IMG_7102.JPG.jpg/IMG_7102.JPG-tiny-2048x0-0.5x0.jpg\",\r\n    \"question\": \"What is the model of car from this brand?\",\r\n    \"question_id\": 'qid',\r\n    \"answer\": [\"保时捷 944\", \"Porsche 944.\"]\r\n}\r\n```\r\n\r\n🔥 The Dyn-VQA **will be updated regularly.** Laset version: 202412.\r\n\r\n# 🛠 Dependencies\r\n\r\n```bash\r\npip install -r requirement.txt\r\n```\r\n\r\n#### Details\r\n\r\n- Python = 3.11.9\r\n- [PyTorch](http://pytorch.org/) (\u003e= 2.0.0)\r\n- pillow = 10.4.0\r\n- requests = 2.32.3\r\n- google-search-results = 2.4.2\r\n- serpapi = 0.1.5\r\n\r\n# 💻 Running OmniSearch\r\n\r\n- GPT-4V-based OmniSearch\r\n\r\nWe have release the code of GPT-4V-based OmniSearch for English questions.\r\n\r\nBefore running, please replace with your own OPENAI key and Google_search key. OPENAI key is at 11-th line of main.py \r\n\r\n```python\r\nGPT_API_KEY = \"your_actual_key_here\"\r\nheaders = {\r\n    \"Authorization\": f\"Bearer {GPT_API_KEY}\"\r\n}\r\n```\r\n\r\nGoogle_search key is at 10-th line of search_api.py\r\n\r\n```python\r\nAPI_KEY = \"your api-key\"\r\n```\r\n\r\nThe result is saved to the path:\r\n\r\n```python\r\noutput_path = os.path.join(meta_save_path, dataset_name, \"output_from_gpt4v.jsonl\")\r\n```\r\n\r\nRun the `main.py` file:\r\n\r\n```bash\r\npython main.py --test_dataset 'path/to/dataset.jsonl' --dataset_name NAME --meta_save_path 'path/to/results'\r\n```\r\n\r\n- Qwen-VL-based OmniSearch\r\n\r\nWe have made the [training data](https://github.com/Alibaba-NLP/OmniSearch/tree/main/dataset/training_data) for Qwen-VL-based OmniSearch publicly available. This data, along with the [CogVLM dataset](https://modelscope.cn/datasets/ZhipuAI/CogVLM-SFT-311K), was used to jointly train the [Qwen-VL-Chat](https://www.modelscope.cn/models/Qwen/Qwen-VL-Chat) using the [SWIFT framework](https://github.com/modelscope/ms-swift). The training script can be executed as follows:\r\n\r\n```\r\nswift sft --model_type qwen-vl-chat --dataset /Data/Path/to/Training_data_1 /Data/Path/to/Training_data_2 --model_id_or_path /Model/Path/to/Qwen-VL-Chat/ --output_dir /Output/Model/Path --max_length 8192 --evaluation_strategy 'no'\r\n```\r\n\r\nYou can download the model from [OmniSearch-Qwen-VL-Chat-en on Hugging Face](https://huggingface.co/Alibaba-NLP/OmniSearch-Qwen-VL-Chat-en/tree/main).\r\n\r\nRun the test script.  Run the `Omnisearch_qwen.py` file:\r\n\r\n```bash\r\npython Omnisearch_qwen.py --test_dataset '/path/to/dataset.jsonl' --dataset_name NAME --meta_save_path '/path/to/results' --model_path '/local/path/to/OmniSearch-Qwen-Chat-VL-weight'\r\n```\r\n\r\n\r\n\r\n# 🔍 Evaluation\r\n\r\nThe evaluation script for token F1-Recall of the output answers can be used as follows:\r\n\r\n```bash\r\npython evaluate.py --evaluate_file_path [path to output jsonl file] --lang [language of the\r\n QA dateset: en/zh]\r\n```\r\n\r\n## 🔥 TODO\r\n\r\n- Release code for Qwen-VL-Chat based OmniSearch\r\n- Release the corresponding model weight\r\n- Create a benchmark for Dyn-VQA\r\n\r\n## 📄 Acknowledge\r\n\r\n- The repo is contributed by Xinyu Wang, Shuo Guo, Zhen Zhang and Yangning Li. \r\n- This work was inspired by ReACT, SelfAsk, FleshLLMs. Sincere thanks for their efforts. \r\n\r\n## 📝 Citation\r\n\r\n```bigquery\r\n@article{li2024benchmarkingmultimodalretrievalaugmented,\r\n      title={Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent}, \r\n      author={Yangning Li and Yinghui Li and Xinyu Wang and Yong Jiang and Zhen Zhang and Xinran Zheng and Hui Wang and Hai-Tao Zheng and Pengjun Xie and Philip S. Yu and Fei Huang and Jingren Zhou},\r\n      year={2024},\r\n      eprint={2411.02937},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.CL},\r\n      url={https://arxiv.org/abs/2411.02937}, \r\n}\r\n```\r\n\r\n\r\nWhen citing our work, please kindly consider citing the original papers. The relevant citation information is listed here.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlibaba-NLP%2FOmniSearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlibaba-NLP%2FOmniSearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlibaba-NLP%2FOmniSearch/lists"}