{"id":13958430,"url":"https://github.com/OpenMOSS/AnyGPT","last_synced_at":"2025-07-21T00:30:49.288Z","repository":{"id":223430174,"uuid":"759306668","full_name":"OpenMOSS/AnyGPT","owner":"OpenMOSS","description":"Code for \"AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling\"","archived":false,"fork":false,"pushed_at":"2024-08-27T13:34:35.000Z","size":10372,"stargazers_count":783,"open_issues_count":17,"forks_count":63,"subscribers_count":21,"default_branch":"main","last_synced_at":"2024-11-23T19:50:02.803Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenMOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-18T08:09:11.000Z","updated_at":"2024-11-22T12:19:18.000Z","dependencies_parsed_at":"2024-02-20T07:44:21.038Z","dependency_job_id":"b3c6e216-3466-45db-b05b-7171d7be6519","html_url":"https://github.com/OpenMOSS/AnyGPT","commit_stats":null,"previous_names":["openmoss/anygpt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FAnyGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FAnyGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FAnyGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FAnyGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenMOSS","download_url":"https://codeload.github.com/OpenMOSS/AnyGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226849981,"owners_count":17691892,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:34.201Z","updated_at":"2024-11-28T02:30:42.127Z","avatar_url":"https://github.com/OpenMOSS.png","language":"Python","readme":"# Official Repository for paper \"AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling\"\n\n\u003ca href=\"https://junzhan2000.github.io/AnyGPT.github.io/\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Project-Page-Green\" alt=\"Project Page Badge\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/pdf/2402.12226.pdf\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Paper-Arxiv-red\" alt=\"Paper Arxiv Badge\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/pdf/2402.12226.pdf\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Datasets-AnyInstruct-yellow\" alt=\"Datasets\"\u003e\n\u003c/a\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"static/images/logo.png\" width=\"16%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n## Introduction\n\nWe introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. The [base model](https://huggingface.co/fnlp/AnyGPT-base) aligns the four modalities, allowing for intermodal conversions between different modalities and text. Furthermore, we constructed the [AnyInstruct](https://huggingface.co/datasets/fnlp/AnyInstruct) dataset based on various generative models, which contains instructions for arbitrary modal interconversion. Trained on this dataset, our [chat model](https://huggingface.co/fnlp/AnyGPT-chat) can engage in free multimodal conversations, where multimodal data can be inserted at will.\n\nAnyGPT proposes a generative training scheme that converts all modal data into a unified discrete representation, using the Next Token Prediction task for unified training on a Large Language Model (LLM). From the perspective of 'compression is intelligence': when the quality of the Tokenizer is high enough, and the perplexity (PPL) of the LLM is low enough, it is possible to compress the vast amount of multimodal data on the internet into the same model, thereby emerging capabilities not present in a pure text-based LLM.\nDemos are shown in [project page](https://junzhan2000.github.io/AnyGPT.github.io).\n\n## Example Demonstrations\n\n[![视频标题](http://img.youtube.com/vi/oW3E3pIsaRg/0.jpg)](https://www.youtube.com/watch?v=oW3E3pIsaRg)\n\n## Open-Source Checklist\n\n- [X] Base Model\n- [X] Chat Model\n- [X] Inference Code\n- [X] Instruction Dataset\n\n## Inference\n\n### Installation\n\n```bash\ngit clone https://github.com/OpenMOSS/AnyGPT.git\ncd AnyGPT\nconda create --name AnyGPT python=3.9\nconda activate AnyGPT\npip install -r requirements.txt\n```\n\n### Model Weights\n\n* Check the AnyGPT-base weights in [fnlp/AnyGPT-base](https://huggingface.co/fnlp/AnyGPT-base)\n* Check the AnyGPT-chat weights in [fnlp/AnyGPT-chat](https://huggingface.co/fnlp/AnyGPT-chat)\n* Check the SpeechTokenizer and Soundstorm weights in [fnlp/AnyGPT-speech-modules](https://huggingface.co/fnlp/AnyGPT-speech-modules)\n* Check the SEED tokenizer weights in [AILab-CVC/seed-tokenizer-2](https://huggingface.co/AILab-CVC/seed-tokenizer-2)\n\nThe SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm is responsible for completing paralinguistic information, and SEED-tokenizer is used for tokenizing images.\n\nThe model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.\n\n### Base model CLI Inference\n\n```bash\npython anygpt/src/infer/cli_infer_base_model.py \\\n--model-name-or-path \"path/to/AnyGPT-7B-base\" \\\n--image-tokenizer-path 'path/to/model' \\\n--speech-tokenizer-path \"path/to/model\" \\\n--speech-tokenizer-config \"path/to/config\" \\\n--soundstorm-path \"path/to/model\" \\\n--output-dir \"infer_output/base\" \n```\n\nfor example\n\n```bash\npython anygpt/src/infer/cli_infer_base_model.py \\\n--model-name-or-path models/anygpt/base \\\n--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \\\n--speech-tokenizer-path models/speechtokenizer/ckpt.dev \\\n--speech-tokenizer-config models/speechtokenizer/config.json \\\n--soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \\\n--output-dir \"infer_output/base\" \n```\n\n#### Interaction\n\nThe Base Model can perform various tasks, including text-to-image, image caption, Automatic Speech Recognition (ASR), Zero-shot Text-to-Speech (TTS), Text-to-Music, and Music Captioning.\n\nWe can perform inference following a specific instruction format.\n\n* Text-to-Image\n  * ``text|image|{caption}``\n  * example:\n    ``text|image|A bustling medieval market scene with vendors selling exotic goods under colorful tents``\n* Image Caption\n  * ``image|text|{caption}``\n  * example:\n    ``image|text|static/infer/image/cat.jpg``\n* TTS(random voice)\n  * ``text|speech|{speech content}``\n  * example:\n    ``text|speech|I could be bounded in a nutshell and count myself a king of infinite space.``\n* Zero-shot TTS\n  * ``text|speech|{speech content}|{voice prompt}``\n  * example:\n    ``text|speech|I could be bounded in a nutshell and count myself a king of infinite space.|static/infer/speech/voice_prompt3.wav``\n* ASR\n  * ``speech|text|{speech file path}``\n  * example: ``speech|text|AnyGPT/static/infer/speech/voice_prompt2.wav``\n* Text-to-Music\n  * ``text|music|{caption}``\n  * example:\n    ``text|music|features an indie rock sound with distinct elements that evoke a dreamy, soothing atmosphere``\n* Music Caption\n  * ``music|text|{music file path}``\n  * example: ``music|text|static/infer/music/features an indie rock sound with distinct element.wav``\n\n**Notes**\n\nFor different tasks, we used different language model decoding strategies. The decoding configuration files for image, speech, and music generation are located in ``config/image_generate_config.json``, ``config/speech_generate_config.json``, and ``config/music_generate_config.json``, respectively. The decoding configuration files for other modalities to text are in ``config/text_generate_config.json``. You can directly modify or add parameters to change the decoding strategy.\n\nDue to limitations in data and training resources, the model's generation may still be unstable. You can generate multiple times or try different decoding strategies.\n\nThe speech and music response will be saved to ``.wav`` files, and the image response will be saved to a ``jpg``. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.\n\n### Chat model CLI Inference\n\n```bash\npython anygpt/src/infer/cli_infer_chat_model.py \n\\ --model-name-or-path 'path/to/model'\n\\ --image-tokenizer-path 'path/to/model'\n\\ --speech-tokenizer-path 'path/to/model'\n\\ --speech-tokenizer-config 'path/to/config'\n\\ --soundstorm-path 'path/to/model'\n\\ --output-dir \"infer_output/chat\"\n```\n\nfor example\n\n```bash\npython anygpt/src/infer/cli_infer_chat_model.py \n\\ --model-name-or-path models/anygpt/chat\n\\ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \n\\ --speech-tokenizer-path models/speechtokenizer/ckpt.dev \n\\ --speech-tokenizer-config models/speechtokenizer/config.json \n\\ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \n\\ --output-dir \"infer_output/chat\"\n```\n\nInstruct format\n\n```bash\ninterleaved|{text_instruction}|{modality}|{image_path}|{voice_prompt}|{speech_instruction}|{music_path}\n```\n\nWhere ``text_instruction`` is the input text command, ``speech_instruction`` is the input voice command; only one needs to be specified.\n\n``image_path`` and ``music_path`` are the paths for the input image and music, respectively. ``voice_prompt`` is the specified tone of the model's response; if not specified, a random tone is used.\n\n``modality`` refers to the type of output modality, which can be chosen as speech, image, or music; otherwise, it is considered as text. This will only affect which decoding configuration file under the config directory is used by the model (this is because the model's training is limited, leading to different decoding strategies for different modalities). It can also decode token by token, modifying the decoding strategy to the corresponding modality when generating the start token of the modality.\n\n**example**\n\n* interleaved||image|||static/infer/speech/instruction/Can you draw me a picture of a sunny beach.wav\n* interleaved||music|||static/infer/speech/instruction/Give me a similar style of music.wav\n\nTo clear the conversation history, please input ``|clear``\n\n### Pretraining and SFT\n\nPlease refer to ``scripts/stage1_pretrain.sh`` and ``scripts/stage2_sft.sh``\n\nWe provide training data samples for reference. The organization of training formats includes pre-training data in [data/pretrain](https://github.com/OpenMOSS/AnyGPT/tree/main/data/pretrain) and instruction data in [data/instruction](https://github.com/OpenMOSS/AnyGPT/tree/main/data/instruction).\nFor prompts of different tasks, refer to [task_prompts](https://github.com/OpenMOSS/AnyGPT/blob/16210f829d3b1aa25b0057ebbab0a78057fb59b5/anygpt/src/m_utils/prompter.py#L19), such as plain text dialogue, voice command text reply, text command voice reply, and special prompts for various tasks. You need to process multi-modal data into multi-round dialogue format according to the task template in advance.\nWe use a voice conversation as an example in the command data, corresponding to the use of task_prompts in the \"Speech-Instruction\" and \"Speech-Response\":\n\n```json\n[\n    {\n        \"role\": \"user\",\n        \"message\": \"\u003csosp\u003e\u003c🗣️1\u003e\u003c🗣️1\u003e\u003c🗣️1\u003e\u003ceosp\u003e Please acknowledge the user's vocal input, create a textual response\"\n    },\n    {\n        \"role\": \"assistant\",\n        \"message\": \"\u003c-Ins-\u003e hello, how are you\\n \u003c-Res-\u003e I am fine, thank you \u003csosp\u003e\u003c🗣️2\u003e\u003c🗣️2\u003e\u003c🗣️2\u003e\u003ceosp\u003e\"\n    }\n]\n```\n\n## Acknowledgements\n\n- [SpeechGPT](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt), [Vicuna](https://github.com/lm-sys/FastChat): The codebase we built upon.\n- We thank the great work from [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer),[soundstorm-speechtokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer), [SEED-tokenizer](https://github.com/AILab-CVC/SEED),\n\n## Lincese\n\n`AnyGPT` is released under the original [License](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) of [LLaMA2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf).\n\n## Citation\n\nIf you find AnyGPT and AnyInstruct useful in your research or applications, please kindly cite:\n\n```\n@article{zhan2024anygpt,\n  title={AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling},\n  author={Zhan, Jun and Dai, Junqi and Ye, Jiasheng and Zhou, Yunhua and Zhang, Dong and Liu, Zhigeng and Zhang, Xin and Yuan, Ruibin and Zhang, Ge and Li, Linyang and others},\n  journal={arXiv preprint arXiv:2402.12226},\n  year={2024}\n}\n```\n","funding_links":[],"categories":["多模态大模型","Building","📂 Methods by Image Processing Type"],"sub_categories":["网络服务_其他","LLM Models","Image-Invariant Text Enhancement"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenMOSS%2FAnyGPT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenMOSS%2FAnyGPT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenMOSS%2FAnyGPT/lists"}