{"id":13543192,"url":"https://github.com/fengyuli-dev/multimedia-gpt","last_synced_at":"2025-04-02T12:31:47.049Z","repository":{"id":146951819,"uuid":"614425687","full_name":"fengyuli-dev/multimedia-gpt","owner":"fengyuli-dev","description":"Empowering your ChatGPT with vision and audio inputs.","archived":false,"fork":false,"pushed_at":"2023-11-19T06:37:19.000Z","size":5528,"stargazers_count":183,"open_issues_count":3,"forks_count":13,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-21T08:32:53.936Z","etag":null,"topics":["chatbot","chatgpt","gpt","openai-api"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fengyuli-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-15T15:01:26.000Z","updated_at":"2024-08-01T11:17:34.475Z","dependencies_parsed_at":"2023-11-19T07:36:41.893Z","dependency_job_id":null,"html_url":"https://github.com/fengyuli-dev/multimedia-gpt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fengyuli-dev%2Fmultimedia-gpt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fengyuli-dev%2Fmultimedia-gpt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fengyuli-dev%2Fmultimedia-gpt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fengyuli-dev%2Fmultimedia-gpt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fengyuli-dev","download_url":"https://codeload.github.com/fengyuli-dev/multimedia-gpt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246815690,"owners_count":20838487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chatgpt","gpt","openai-api"],"created_at":"2024-08-01T11:00:25.605Z","updated_at":"2025-04-02T12:31:42.042Z","avatar_url":"https://github.com/fengyuli-dev.png","language":"Python","funding_links":[],"categories":["HarmonyOS","精选开源项目合集"],"sub_categories":["Windows Manager","GPT镜像平替"],"readme":"**This repository is not actively maintained as there are recent corporate projects that share our vision, such as [TaskMatrix](https://github.com/microsoft/TaskMatrix), [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT), and [HuggingGPT](https://github.com/microsoft/JARVIS), that benefit from larger team effort and better management.** \n-------\n\n# Multimedia GPT\n\nMultimedia GPT connects your OpenAI GPT with vision and audio. You can now send images, audio recordings, and pdf documents using your OpenAI API key, and get a response in both text and image formats. We are currently adding support for videos. All is made possible by a prompt manager inspired and built upon [Microsoft Visual ChatGPT](https://github.com/microsoft/visual-chatgpt).\n\n\u003c!-- ## Support Us\nThis project is under active development, and more features will be added soon. **Please consider :star: star us** or follow the [author](https://github.com/fengyuli-dev) if this idea is interesting to you. We thank all our [supporters](#supporters)! --\u003e\n\n## Models\nIn addition to all of the vision foundation models mentioned in [Microsoft Visual ChatGPT](https://github.com/microsoft/visual-chatgpt), Multimedia GPT supports [OpenAI Whisper](https://openai.com/research/whisper) and [OpenAI DALLE](https://openai.com/blog/dall-e-api-now-available-in-public-beta)! This means that **you no longer need your own GPUs for voice recognition and image generation** (although you still can!)\n\nThe base chat model can be configured as **any OpenAI LLM**, including ChatGPT and GPT-4. We default to `text-davinci-003`.\n\nYou are welcome to fork this project and add models that's suitable for your own use case. A simple way to do this is through [llama_index](https://github.com/jerryjliu/llama_index). You will have to create a new class for your model in `model.py`, and add a runner method `run_\u003cmodel_name\u003e` in `multimedia_gpt.py`. See `run_pdf` for an example.\n\n## Demo \nIn this demo, ChatGPT is fed with a recording of [a person telling the story of Cinderella](public/cinderella.mp3).\n\n![](./public/demo-1.png)\n![](./public/demo-2.jpg)\n\n\n## Installation\n\n```bash\n# Clone this repository\ngit clone https://github.com/fengyuli2002/multimedia-gpt\ncd multimedia-gpt\n\n# Prepare a conda environment\nconda create -n multimedia-gpt python=3.8\nconda activate multimedia-gptt\npip install -r requirements.txt\n\n# prepare your private OpenAI key (for Linux / MacOS)\necho \"export OPENAI_API_KEY='yourkey'\" \u003e\u003e ~/.zshrc\n# prepare your private OpenAI key (for Windows)\nsetx OPENAI_API_KEY “\u003cyourkey\u003e”\n\n# Start Multimedia GPT!\n# You can specify the GPU/CPU assignment by \"--load\", the parameter indicates which foundation models to use and \n# where it will be loaded to. The model and device are separated by '_', different models are separated by ','.\n# The available Visual Foundation Models can be found in models.py\n# For example, if you want to load ImageCaptioning to cuda:0 and whisper to cpu \n# (whisper runs remotely, so it doesn't matter where it is loaded to)\n# You can use: \"ImageCaptioning_cuda:0,Whisper_cpu\"\n\n# Don't have GPUs? No worry, you can run DALLE and Whisper on cloud using your API key!\npython multimedia_gpt.py --load ImageCaptioning_cpu,DALLE_cpu,Whisper_cpu       \n\n# Additionally, you can configure the which OpenAI LLM to use by the \"--llm\" tag, such as \npython multimedia_gpt.py --llm text-davinci-003  \n# The default is gpt-3.5-turbo (ChatGPT).  \n```\n\n## Plans\nThis project is an experimental work and will not be deployed to a production environment. Our goal is to explore the power of prompting. \n### TODOs\n- [x] Support OpenAI Whisper for speech recognition, added to the default config\n- [x] Support OpenAI DALLE for image generation, added to the default config\n- [x] Support OpenAI DALLE for image editing\n- [x] Add a command-line switch between ChatGPT and GPT-4 backends\n- [x] Implement a function that extract key frames from a video\n  \n### Known Problems\n- [x] DALLE only accepts square .png images — need a work-around\n- [ ] PDFReader (from llama_index) requires a higher version of langchain, which isn't compatible with how visual chatGPT is implemented\n\n## Supporters\n[![Stargazers repo roster for @fengyuli-dev/multimedia-gpt](https://reporoster.com/stars/dark/fengyuli-dev/multimedia-gpt)](https://github.com/fengyuli-dev/multimedia-gpt/stargazers)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffengyuli-dev%2Fmultimedia-gpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffengyuli-dev%2Fmultimedia-gpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffengyuli-dev%2Fmultimedia-gpt/lists"}