{"id":17938848,"url":"https://github.com/bilel-bj/ROSGPT_Vision","last_synced_at":"2025-03-24T10:31:45.557Z","repository":{"id":189927827,"uuid":"680733537","full_name":"bilel-bj/ROSGPT_Vision","owner":"bilel-bj","description":"Commanding robots using only Language Models' prompts","archived":false,"fork":false,"pushed_at":"2025-02-16T06:37:02.000Z","size":22263,"stargazers_count":96,"open_issues_count":1,"forks_count":13,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-16T07:23:03.621Z","etag":null,"topics":["chatgpt","language-models","language-models-are-next","large-language-models","llm","prompt-engineering","prompting-robotic-modalities","robotic-design-patterns","robotic-vision","robotics","ros2","visual-language-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bilel-bj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-20T08:10:06.000Z","updated_at":"2025-02-16T06:37:05.000Z","dependencies_parsed_at":"2024-08-07T09:44:02.786Z","dependency_job_id":"1e248406-406a-4229-ac3e-19b41673e8dc","html_url":"https://github.com/bilel-bj/ROSGPT_Vision","commit_stats":null,"previous_names":["bilel-bj/rosgpt_vision"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bilel-bj%2FROSGPT_Vision","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bilel-bj%2FROSGPT_Vision/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bilel-bj%2FROSGPT_Vision/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bilel-bj%2FROSGPT_Vision/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bilel-bj","download_url":"https://codeload.github.com/bilel-bj/ROSGPT_Vision/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245252302,"owners_count":20585007,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","language-models","language-models-are-next","large-language-models","llm","prompt-engineering","prompting-robotic-modalities","robotic-design-patterns","robotic-vision","robotics","ros2","visual-language-models"],"created_at":"2024-10-29T00:06:23.880Z","updated_at":"2025-03-24T10:31:45.550Z","avatar_url":"https://github.com/bilel-bj.png","language":"Python","funding_links":[],"categories":["Research-Grade Frameworks","Paper List"],"sub_categories":["Follow-up Papers"],"readme":"# ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts\n\n[Bilel Benjdira](https://github.com/bilel-bj), [Anis Koubaa](https://github.com/aniskoubaa) and [Anas M. Ali](https://github.com/AnasHXH)\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2308.11236)\n[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/nYnpzSCaMyw) \u003ca href=\"https://www.sciencedirect.com/science/article/abs/pii/S0167739X25000184\"\u003e\n  \u003cimg src=\"https://sdfestaticassets-eu-west-1.sciencedirectassets.com/prod/44c3817e58b49348a73e63fb998fb7b2924522e1/image/elsevier-non-solus.png\" alt=\"arXiv\" width=\"100\" height=\"50\"\u003e\n\u003c/a\u003e\n\n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/paper.png\" width=\"900\" height=\"600\"/\u003e\n**Robotics and Internet of Things Lab (RIOTU Lab), Prince Sultan University, Saudi Arabia**\n\nInspired by  [ROSGPT](https://github.com/aniskoubaa/rosgpt). Both projects aim to bridge the gap between robotics, natural language understanding, and image analysis. \n\nCollaborators who want to participate in this project, are very welcome. \n\n------------------------------------------------------------------------------------------------------------------------------------------\n- **ROSGPT_Vision** is a new robotic framework dsigned to command robots using only two prompts:\n\t- a **Visual Prompt** (for visual semantic features), and\n \t- an **LLM Prompt** (to regulate robotic reactions).\n- It is based on a new robotic design pattern: **Prompting Robotic Modalities (PRM)**.\n- **ROSGPT_Vision** is used to develop **CarMate**, a robotic application for  monitoring driver distractions and providing real-time vocal notifications. It showcases cost-effective development.\n- We demonstrated how to optimize the prompting strategies to improve the application.\n- LangChain framework is used by to easily customize prompts.\n- More details are described in the academic paper \"ROSGPT_Vision: Commanding Robots using only Language Models' Prompts\".\n\n\n# Video Demo\nAn illustrative video demonstration of ROSGPT_Vision is provided:\n[![ROSGPT Video Demonstration](https://github.com/bilel-bj/ROSGPT_Vision/blob/main/video_thumbnail.png)](https://youtu.be/nYnpzSCaMyw)\n\n## Table of Contents\n\n- [Overview](#overview)\n- [ROSGPT_Vision diagram](#rosgpt_vision-diagram)\n- [Prompting Robotic Modalities (PRM) Design Pattern](#prompting-robotic-modalities-prm-design-pattern)\n- [CarMate Application](#carmate-application)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Citation](#citation)\n- [License](#license)\n- [Acknowledgement](#acknowledgement)\n- [Contribute](#contribute)\n\n## Overview\n\n**ROSGPT_Vision** offers a unified platform that allows robots to perceive, interpret, and interact with visual data through natural language. The framework leverages state-of-the-art language models, including [LLAVA](https://github.com/haotian-liu/LLaVA), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), and [Caption-Anything](https://github.com/facebookresearch/segment-anything), to facilitate advanced reasoning about image data. [LangChain](https://github.com/langchain-ai/langchain) is used for easy customization of the prompts. The provided implementation includes the **CarMate** application, a driver monitoring and assistance system designed to ensure safe and efficient driving experiences.\n## ROSGPT_Vision diagram\n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/ROSGPT_Vision.png\" width=\"900\" height=\"600\"/\u003e\n\n## Prompting Robotic Modalities (PRM) Design Pattern\n- A new design approach emphasizing modular and individualized sensory queries.\n- Uses specific **Modality Language Models (MLM)** for textual interpretations of inputs, like the **Vision Language Model (VLM)** for visual data.\n- Ensures precise data collection by treating each sensory input separately.\n- **Task Modality**'s Role: Serves as the central coordinator, synthesizing data from various modalities.\n\n** for more information go to [![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2308.11236)\n  \n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/IRM_Diagram%20(1).png\" width=\"800\" height=\"500\"/\u003e\n\n## CarMate Application\n**CarMate** is a complete application for monitoring driver behavior which was developed  just by setting two prompts in the YAML file. It automatically analyses the input video using the Visual prompt, analyses what should be done using the LLM prompt, and gives an instant alert to the driver when needed. \n\nThese are the prompts used to develop the application, without needing extra code: \n\n**The Visual prompt:**\n\n\tVisual prompt: \"Describe the driver’s current level of focus \n \ton driving based on the visual cues, Answer with one short sentence.\"\n\n**The LLM prompt:**\n\n\tLLM prompt:\"Consider the following ontology: You must write your Reply \n \twith one short sentence. Behave as a carmate that surveys the driver \n  \tand gives him advice and instruction to drive safely. You will be given \n   \thuman language prompts describing an image. Your task is to provide \n    \tappropriate instructions to the driver based on the description.\"\n\nWe can see three examples of scenarios, got during the driving: \n\n### Scenario 1: The driver is using phone\nWe can see in the top box the description generated by the image semantics module for the input image using the Visual prompt. \nMeanwhile, the second box generates the alert that should be given to the driver using the LLM prompt. \n\n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-phone.png\" width=\"900\" height=\"600\"/\u003e\n\n### Scenario 2: The driver is taking pictures \n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-taking-pictures.png\" width=\"900\" height=\"600\"/\u003e\n\n### Scenario 3: The driver is drinking\n\u003cimg src=\"https://github.com/bilel-bj/ROSGPT_Vision/blob/main/demo-distraction-drinking.png\" width=\"900\" height=\"600\"/\u003e\n\n\n## Installation\n#### To use ROSGPT_Vision, follow these steps:\n**1. Prepare the code and the environment**\n\n  Git clone our repository, creating a python environment and ativate it via the following command\n\n```bash\n  git clone https://github.com/bilel-bj/ROSGPT_Vision.git\n  cd ROSGPT_Vision\n  git clone https://github.com/Vision-CAIR/MiniGPT-4.git\n  git clone https://github.com/haotian-liu/LLaVA.git\n  conda env create -f environment.yml\n  conda activate ROSGPT_Vision\n```\n\n\n\n**2. Install the required dependencies**\n\n- You can run image_semantics.py by install all required dependencies from [LLAVA](https://github.com/haotian-liu/LLaVA), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4) and [Caption-Anything](https://github.com/facebookresearch/segment-anything).\n\n- Ensure the installation of all requisite dependencies for ROS2.\n\n\n\n## Usage\n1. **To regulate all parameters associated with ROSGPT_Vision, modifications can be made within the corresponding .yaml file.**\n\n---\n\n\u003e **The YAML contains 6 main sections of configurations parameters:**\n\n\n- **Task_name**: This field specifies the name of the task that the ROS system is configured to perform. \n\n- **ROSGPT_Vision_Camera_Node**: This section contains the configuration for the ROSGPT\\_Vision\\_Camera\\_Node. \n\n- **Image_Description_Method**: This field specifies the method used by the node to generate descriptions from images. It can be one of the currently developed methods: MiniGPT4, LLaVA, or SAM. The configurations needed for everyone of them is put separately at the end of this file. \n\n- **Vision_prompt**: This field specifies the prompt used to guide the image description process.\n\n- **Output_video**: This field specifies the path or the name of where to save the  output video file.\n\n- **GPT_Consultation_Node**: This section contains the configuration for the GPT\\_Consultation\\_Node.\n\n\t- **llm_prompt**: This field specifies the prompt used to guide the language model.\n  \n\t- **GPT_temperature**: This field specifies the temperature parameter for the GPT model, which controls the randomness of the model's output.\n\n- **MiniGPT4_parameters**: This section contains the configuration for the MiniGPT4 model. It should be clearly set if the model is used in this task, otherwise it could be empty. \n\n\t- **configuration**: This field specifies the path for the configuration file of MiniGPT4.\n\n\t- **temperature_miniGPT4**: This field specifies the temperature parameter for the MiniGPT4 model.\n\n- **llava_parameters**: This section contains the configuration for the llavA model (if used).\n\n\t- **temperature_llavA**: This field specifies the temperature parameter for the llavA model.\n\n- **SAM_parameters**: This section contains the configuration for the SAM model.\n\n\t- **weights_SAM**: This field specifies the weights used by the SAM model.\n\n\n2. **Run in Terminal local machine**\n\n- run first terminal : \n\n```bash\n        colcon build --packages-select rosgpt_vision\n\t\t    source install/setup.bash\n\t\t    python3 src/rosgpt_vision/rosgpt_vision/rosgpt_vision_node_web_cam.py\n\t\t    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_Camera_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml\n```   \n- run second terminal:\n\n```bash\n        colcon build --packages-select rosgpt_vision \n\t\t    source install/setup.bash\n\t\t    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_GPT_Consultation_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml\n```   \n- run third terminal:  \n\n```bash ros2 topic echo /Image_Description ```\n\n- run fourth terminal:  \n\n```bash ros2 topic echo /GPT_Consultation ```   \n\n## Citation\n\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2308.11236)  \u003ca href=\"https://www.sciencedirect.com/science/article/abs/pii/S0167739X25000184\"\u003e\n  \u003cimg src=\"https://sdfestaticassets-eu-west-1.sciencedirectassets.com/prod/44c3817e58b49348a73e63fb998fb7b2924522e1/image/elsevier-non-solus.png\" alt=\"arXiv\" width=\"100\" height=\"50\"\u003e\n\u003c/a\u003e\n\n\n\t@article{BENJDIRA2025107723,\n\ttitle = {Prompting Robotic Modalities (PRM): A structured architecture for centralizing language models in complex systems},\n\tjournal = {Future Generation Computer Systems},\n\tvolume = {166},\n\tpages = {107723},\n\tyear = {2025},\n\tissn = {0167-739X},\n\tdoi = {https://doi.org/10.1016/j.future.2025.107723},\n\turl = {https://www.sciencedirect.com/science/article/pii/S0167739X25000184},\n\tauthor = {Bilel Benjdira and Anis Koubaa and Anas M. Ali},\n\tkeywords = {Expert systems architectures, Robotics, Languages models in robotics, Prompting robotic modalities, Large language models, LLMs, Vision language models, VLMs, Robotic operating system, ROS, ROS2, Robotic prompt engineering, Visual prompt, LLM prompt},\n\t}\n    \n## License\n\nThis project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. You are free to use, share, and adapt this material for non-commercial purposes, as long as you provide attribution to the original author(s) and the source.\n\n## Acknowledgement\n\nThe codes are based on [ROSGPT](https://github.com/aniskoubaa/rosgpt), [LLAVA](https://github.com/haotian-liu/LLaVA), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [Caption-Anything](https://github.com/facebookresearch/segment-anything) and [SAM](https://github.com/ttengwang/Caption-Anything). Please also follow their licenses. Thanks for their awesome works.\n\n## Contribute\n\nAs this project is still under progress, contributions are welcome! To contribute, please follow these steps:\n\n1. Fork the repository on GitHub.\n2. Create a new branch for your feature or bugfix.\n3. Commit your changes and push them to your fork.\n4. Create a pull request to the main repository.\n\nBefore submitting your pull request, please ensure that your changes do not break the build and adhere to the project's coding style.\n\nFor any questions or suggestions, please open an issue on the [GitHub issue tracker](https://github.com/bilel-bj/ROSGPT_Vision/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbilel-bj%2FROSGPT_Vision","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbilel-bj%2FROSGPT_Vision","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbilel-bj%2FROSGPT_Vision/lists"}