{"id":18470769,"url":"https://github.com/om-ai-lab/omagent","last_synced_at":"2025-05-14T12:09:38.118Z","repository":{"id":247313724,"uuid":"824259091","full_name":"om-ai-lab/OmAgent","owner":"om-ai-lab","description":"Build multimodal language agents for fast prototype and production","archived":false,"fork":false,"pushed_at":"2025-03-19T11:36:13.000Z","size":11956,"stargazers_count":2465,"open_issues_count":17,"forks_count":271,"subscribers_count":133,"default_branch":"main","last_synced_at":"2025-04-12T22:16:56.378Z","etag":null,"topics":["agent","chatbot","gemini","gpt","gpt4","gradio","language-agent","large-language-models","llama","llava","llm","multimodal","multimodal-agent","openai","python","rag","smart-hardware","vision-and-language","vlm","workflow"],"latest_commit_sha":null,"homepage":"https://om-agent.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/om-ai-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-04T17:48:43.000Z","updated_at":"2025-04-10T03:43:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"068d1c35-0a73-40c9-8595-85cbe294264e","html_url":"https://github.com/om-ai-lab/OmAgent","commit_stats":null,"previous_names":["om-ai-lab/omagent"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FOmAgent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FOmAgent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FOmAgent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FOmAgent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/om-ai-lab","download_url":"https://codeload.github.com/om-ai-lab/OmAgent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248637786,"owners_count":21137538,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","chatbot","gemini","gpt","gpt4","gradio","language-agent","large-language-models","llama","llava","llm","multimodal","multimodal-agent","openai","python","rag","smart-hardware","vision-and-language","vlm","workflow"],"created_at":"2024-11-06T10:14:50.185Z","updated_at":"2025-04-12T22:17:10.182Z","avatar_url":"https://github.com/om-ai-lab.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/OmAgent-banner.png\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n\u003cdiv\u003e\n    \u003ch1 align=\"center\"\u003e🌟 Build Multimodal Language Agents with Ease 🌟\u003c/h1\u003e\n\u003c/div\u003e\n\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://twitter.com/intent/follow?screen_name=OmAI_lab\" target=\"_blank\"\u003e\n    \u003cimg alt=\"X (formerly Twitter) Follow\" src=\"https://img.shields.io/twitter/follow/OmAI_lab\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://discord.gg/G9n5tq4qfK\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Discord\" src=\"https://img.shields.io/discord/1296666215548321822?style=flat\u0026logo=discord\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\n## 📖 Introduction  \nOmAgent is python library for building multimodal language agents with ease. We try to keep the library **simple** without too much overhead like other agent framework.   \n - We wrap the complex engineering (worker orchestration, task queue, node optimization, etc.) behind the scene and only leave you with a super-easy-to-use interface to define your agent.   \n - We further enable useful abstractions for reusable agent components, so you can build complex agents aggregating from those basic components.   \n - We also provides features required for multimodal agents, such as native support for VLM models, video processing, and mobile device connection to make it easy for developers and researchers building agents that can reason over not only text, but image, video and audio inputs. \n\n## 🔑 Key Features  \n - A flexible agent architecture that provides graph-based workflow orchestration engine and various memory type enabling contextual reasoning.  \n - Native multimodal interaction support include VLM models, real-time API, computer vision models, mobile connection and etc.   \n - A suite of state-of-the-art unimodal and multimodal agent algorithms that goes beyond simple LLM reasoning, e.g. ReAct, CoT, SC-Cot etc.   \n - Supports local deployment of models. You can deploy your own models locally by using Ollama[Ollama](./docs/concepts/models/Ollama.md) or [LocalAI](./examples/video_understanding/docs/local-ai.md).\n - Fully distributed architecture, supports custom scaling. Also supports Lite mode, eliminating the need for middleware deployment.\n\n\n## 🛠️ How To Install\n- python \u003e= 3.10\n- Install omagent_core  \n  Use pip to install omagent_core latest release.\n  ```bash\n  pip install omagent-core\n  ```\n  Or install the latest version from the source code like below.\n  ```bash\n  pip install -e omagent-core\n  ```\n\n## 🚀 Quick Start \n### Configuration\n\nThe container.yaml file is a configuration file that manages dependencies and settings for different components of the system. To set up your configuration:\n\n1. Generate the container.yaml file:\n   ```bash\n   cd examples/step1_simpleVQA\n   python compile_container.py\n   ```\n   This will create a container.yaml file with default settings under `examples/step1_simpleVQA`. For more information about the container.yaml configuration, please refer to the [container module](./docs/concepts/container.md)\n\n2. Configure your LLM settings in `configs/llms/gpt.yml`:\n\n   - Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file\n   ```bash\n   export custom_openai_key=\"your_openai_api_key\"\n   export custom_openai_endpoint=\"your_openai_endpoint\"\n   ```\n   You can use a locally deployed Ollama to call your own language model. The tutorial is [here](docs/concepts/models/Ollama.md).\n\n### Run the demo\n\n1. Run the simple VQA demo with webpage GUI:\n\n   For WebpageClient usage: Input and output are in the webpage\n   ```bash\n   cd examples/step1_simpleVQA\n   python run_webpage.py\n   ```\n   Open the webpage at `http://127.0.0.1:7860`, you will see the following interface:  \n   \u003cimg src=\"docs/images/simpleVQA_webpage.png\" width=\"400\"/\u003e\n\n## 🤖  Example Projects\n### 1. Video QA Agents\nBuild a system that can answer any questions about uploaded videos with video understanding agents. we provide a gradio based application, see details [here](examples/video_understanding/README.md).  \n\u003cp \u003e\n  \u003cimg src=\"docs/images/video_understanding_gradio.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\nMore about the video understanding agent can be found in [paper](https://arxiv.org/abs/2406.16620).\n\u003cp \u003e\n  \u003cimg src=\"docs/images/OmAgent.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n\n### 2. Mobile Personal Assistant\nBuild your personal mulitmodal assistant just like Google Astral in 2 minutes. See Details [here](docs/tutorials/agent_with_app.md).\n\u003cp \u003e\n  \u003cimg src=\"docs/images/readme_app.png\" width=\"200\"/\u003e\n\u003c/p\u003e\n\n\n### 3. Agentic Operators\nWe define reusable agentic workflows, e.g. CoT, ReAct, and etc as **agent operators**. This project compares various recently proposed reasoning agent operators with the same LLM choice and test datasets. How do they perform? See details [here](docs/concepts/agent_operators.md).\n\n| **Algorithm** |  **LLM**  | **Average** | **gsm8k-score** | **gsm8k-cost($)** | **AQuA-score** | **AQuA-cost($)** |\n | :-----------------: | :------------: | :-------------: | :---------------: | :-------------------: | :------------------------------------: | :---: |\n |       SC-COT       |  gpt-3.5-turbo  |       73.69       |         80.06         |                            5.0227                            | 67.32 | 0.6491 |\n|         COT        |  gpt-3.5-turbo  |       69.86       |         78.70         |                            0.6788                            | 61.02 | 0.0957 |\n|      ReAct-Pro      |  gpt-3.5-turbo  |       69.74       |         74.91         |                            3.4633                            | 64.57 | 0.4928 |\n |         POT         |  gpt-3.5-turbo  |       64.42       |         76.88         |                            0.6902                            | 51.97 | 0.1557 |\n |         IO*        |  gpt-3.5-turbo  |       38.40       |         37.83         |                            0.3328                            | 38.98 | 0.0380 |\n\n*IO: Input-Output Direct Prompting (Baseline)  \n\nMore Details in our new repo [open-agent-leaderboard](https://github.com/om-ai-lab/open-agent-leaderboard) and [Hugging Face space](https://huggingface.co/spaces/omlab/open-agent-leaderboard)\n\n\n## 💻 Documentation\nMore detailed documentation is available [here](https://om-ai-lab.github.io/OmAgentDocs/).\n\n## 🤝 Contributing\nFor more information on how to contribute, see [here](CONTRIBUTING.md).  \nWe value and appreciate the contributions of our community. Special thanks to our contributors for helping us improve OmAgent.\n\n\u003ca href=\"https://github.com/om-ai-lab/OmAgent/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=om-ai-lab/OmAgent\" /\u003e\n\u003c/a\u003e\n\n## 🔔 Follow us\nYou can follow us on [X](https://x.com/OmAI_lab), [Discord](https://discord.gg/G9n5tq4qfK) and WeChat group for more updates and discussions.  \n\u003cp \u003e\n  \u003cimg src=\"docs/images/readme_qr_code.png\" width=\"200\"/\u003e\n\u003c/p\u003e\n\n\n## 🔗 Related works\nIf you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:  \n🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177) (AAAI24)   \n🏠 [GitHub Repository](https://github.com/om-ai-lab/OVDEval/tree/main)\n\n🔆 [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268) (IET Computer Vision)  \n🏠 [Github Repository](https://github.com/om-ai-lab/OmDet)\n\n## ⭐️ Citation\n\nIf you find our repository beneficial, please cite our paper:  \n```angular2\n@article{zhang2024omagent,\n  title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},\n  author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},\n  journal={arXiv preprint arXiv:2406.16620},\n  year={2024}\n}\n```\n","funding_links":[],"categories":["Chatbots","Agent Categories"],"sub_categories":["\u003ca name=\"Unclassified\"\u003e\u003c/a\u003eUnclassified"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fom-ai-lab%2Fomagent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fom-ai-lab%2Fomagent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fom-ai-lab%2Fomagent/lists"}