{"id":30923659,"url":"https://github.com/amd-agi/gpt-fast","last_synced_at":"2025-09-10T04:38:31.317Z","repository":{"id":291372891,"uuid":"944664408","full_name":"AMD-AGI/gpt-fast","owner":"AMD-AGI","description":"The GPT-Fast for Multimodal Models on AMD GPUs","archived":false,"fork":false,"pushed_at":"2025-08-31T20:18:31.000Z","size":6336,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-08T05:45:30.653Z","etag":null,"topics":["amd","gptfast","inference","llama","llava","multimodal","multimodal-large-language-models","qwen","rocm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AMD-AGI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-07T18:44:34.000Z","updated_at":"2025-08-31T20:10:50.000Z","dependencies_parsed_at":"2025-08-31T22:21:08.120Z","dependency_job_id":null,"html_url":"https://github.com/AMD-AGI/gpt-fast","commit_stats":null,"previous_names":["amd-aig-aima/gpt-fast","amd-agi/gpt-fast"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AMD-AGI/gpt-fast","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMD-AGI%2Fgpt-fast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMD-AGI%2Fgpt-fast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMD-AGI%2Fgpt-fast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMD-AGI%2Fgpt-fast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AMD-AGI","download_url":"https://codeload.github.com/AMD-AGI/gpt-fast/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMD-AGI%2Fgpt-fast/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274412379,"owners_count":25280197,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amd","gptfast","inference","llama","llava","multimodal","multimodal-large-language-models","qwen","rocm"],"created_at":"2025-09-10T04:38:27.108Z","updated_at":"2025-09-10T04:38:31.296Z","avatar_url":"https://github.com/AMD-AGI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multimodal gpt-fast\r\n\r\n![Demo](./media/MMSpecDec.gif)\r\n\r\nThis is a multimodal version of GPT-Fast that adds support for vision-language models, allowing the framework to process both text and images.\r\n\r\nFeaturing:\r\n1. Very low latency\r\n2. \u003c1000 lines of Python\r\n3. No dependencies other than PyTorch and Transformers\r\n4. int8/int4/fp8 quantizations\r\n5. Speculative decoding\r\n6. Tensor parallelism\r\n7. Supports AMD GPUs\r\n\r\nThis is NOT intended to be a \"framework\" or \"library\" - it is intended to show off what kind of performance you can get with native PyTorch :) Please copy-paste and fork as you desire.\r\n\r\nFor an in-depth walkthrough of what's in this codebase, see this [blog post](link_to_be_added).\r\n\r\n## Supported Models\r\n\r\n### Text Models\r\n- LLaMA family models (Llama-2, Llama-3, Llama-3.1, Llama-3.2, AMD-Llama) (Example \u003ca href=\"https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e, \u003ca href=\"https://huggingface.co/amd/AMD-Llama-135m\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e, ...)\r\n\r\n- Qwen family models (Qwen-2, Qwen-2.5) (Example: \u003ca href=\"https://huggingface.co/Qwen/Qwen2.5-3B-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e, ...)\r\n\r\n\r\n### Multimodal Models\r\nThis version adds support for several vision-language models:\r\n\r\n#### Qwen Vision-Language Models\r\n- Qwen/Qwen-2.5-VL-3B-Instruct \u003ca href=\"https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n- Qwen/Qwen-2.5-VL-7B-Instruct \u003ca href=\"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n- Qwen/Qwen-2.5-VL-72B-Instruct \u003ca href=\"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n\r\n#### Llava One-Vision Models\r\n- lmms-lab/Llava-One-Vision-Qwen2-0.5B-Si \u003ca href=\"https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-si\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n- lmms-lab/Llava-One-Vision-Qwen2-7B-Si \u003ca href=\"https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-si\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n- lmms-lab/Llava-One-Vision-Qwen2-72B-Si \u003ca href=\"https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-si\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n\r\n#### Llama-3.2-Vision-Instruct Models\r\n- meta-llama/Llama-3.2-11B-Vision-Instruct \u003ca href=\"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n- meta-llama/Llama-3.2-90B-Vision-Instruct \u003ca href=\"https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e🤗\u003c/a\u003e\r\n\r\n\r\n\r\n## Getting Started\r\n### Installation\r\nFirst, install [PyTorch](http://pytorch.org/) according to the instructions specific to your operating system. For AMD GPUs, we strongly recommend using ROCm Software dockers like [rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).\r\nYou can install the required packages using the command below to avoid reinstalling Torch from scratch.\r\n```bash\r\npip install -r requirements.txt -c constraints.txt\r\n```\r\n\r\n### Download and Convert Model Weights\r\n\r\nTo download and convert the models listed in the supported model above, use the following command to download the HF model checkpoints:\r\n```bash\r\nbash scripts/prepare.sh \u003cHF_model/repo_id\u003e \u003cdownload_dir\u003e \r\n```\r\nwhere `\u003cHF_model/repo_id\u003e` is the model id from the [HuggingFace](https://huggingface.co/) website. This script will download the model weights from HuggingFace and then convert them to the format supported by this GPTFast repo. You will need to have your HuggingFace token added to the environment for the gated models. If you have not done that, you can use this command:\r\n```bash\r\nhuggingface-cli login\r\n```\r\n### Optional: Quantize Model Weights\r\nTo save memory and potentially improve performance, you can quantize models to int8, int4, or fp8:\r\n\r\n```bash\r\npython quantize.py --checkpoint_path \u003cdownload_dir\u003e/\u003cHF_model/repo_id\u003e/model.pth --mode int8\r\n```\r\nYou can also directly apply quantization when preparing models by adding the quantization mode as a third parameter:\r\n```bash\r\nbash scripts/prepare.sh \u003cHF_model/repo_id\u003e \u003cdownload_dir\u003e int8\r\n```\r\n\r\n### Run inference\r\n\r\n#### Benchmarking\r\nTo run vanilla decoding benchmarks, use the `evaluate.py` script like below:\r\n\r\n```bash\r\npython evaluate.py --bench_name MMMU --checkpoint_path   \u003cdownload_dir\u003e/\u003cHF_model/repo_id\u003e/model.pth`\r\n```\r\n\r\nTo run speculative decoding, add the draft models' arguments as below:\r\n\r\n```bash\r\npython evaluate.py --bench_name MMMU --checkpoint_path  \u003cdownload_dir\u003e/\u003cHF_model_target/repo_id\u003e/model.pth --draft_checkpoint_path  \u003cdownload_dir\u003e/\u003cHF_model_draft/repo_id\u003e/model.pth --speculate_k \u003cnum_of_draft_tokens\u003e`\r\n```\r\n- To compile the model forward passes using `torch.compile()`, you can use the `--compile` flag. Since compilation benefits from a fixed length kv-cache size, it is recommended to use a cache size large enough for both the target and the draft models as below by setting the `--max_cache_size` and `--draft_max_cache_size` arguments:\r\n\r\n```bash\r\npython evaluate.py --bench_name MMMU --checkpoint_path  \u003cdownload_dir\u003e/\u003cHF_model_target/repo_id\u003e/model.pth  --draft_checkpoint_path \u003cdownload_dir\u003e/\u003cHF_model_draft/repo_id\u003e/model.pth --speculate_k \u003cnum_of_draft_tokens\u003e --compile --max_cache_size \u003ctarget_model_cache_size\u003e --draft_max_cache_size \u003ctarget_model_cache_size\u003e\r\n```\r\n- For the Llama 3.2 vision models, it is also preferred to set `--cross_attention_seq_length` as well to fix the kv-cache size of the cross attention layers.\r\n\r\n- To leverage the draft model’s visual token compression for faster speculative decoding, you can use the `--mm_prune_method='random'` or  `--mm_prune_method='structured'` along with `--mm_prune_ratio=\u003cprune_ratio\u003e`.\r\n\r\n- For speculative decoding on very large models such as Llama 3.2 90B, you can use the drafter in a seperate gpu with `--draft_device` arguments.\r\n\r\n- To use the Tensor Parallel distributed strategy for large multimodal models, you can use the following command. Note that models such as Qwen 0.5B/7B and Llava 0.5B/7B may not adopt this approach on 8 GPUs, as their attention sizes are not evenly divisible by 8.\r\n\r\n```bash\r\nENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=\u003cnum_gpus\u003e evaluate.py --bench_name MMMU --checkpoint_path  \u003cdownload_dir\u003e/\u003cHF_model_target/repo_id\u003e/model.pth --draft_checkpoint_path  \u003cdownload_dir\u003e/\u003cHF_model_draft/repo_id\u003e/model.pth --speculate_k \u003cnum_of_draft_tokens\u003e`\r\n```\r\n\r\n#### Interactive Text Generation with Web UI\r\nTo run the Gradio app to interact with the model, use the following command. If you have not installed the Gradio library, you can install it using the command below:\r\n\r\n```bash\r\npip install gradio\r\n```\r\n\r\nNow you can run the app with the following command:\r\n```bash\r\npython app.py --checkpoint_path \u003cdownload_dir\u003e/\u003cHF_model/repo_id\u003e/model.pth\r\n```\r\n\r\nTo use speculative decoding, add the following arguments:\r\n\r\n```bash\r\npython app.py --checkpoint_path \u003cdownload_dir\u003e/\u003cHF_model/repo_id\u003e/model.pth --speculate_k \u003c#_of_draft_tokens\u003e\r\n```\r\n\r\nThe web UI automatically detects if your model is multimodal and displays an image upload interface if it is. You can:\r\n- Upload images\r\n- Adjust temperature and other sampling parameters\r\n- Toggle speculative decoding on/off\r\n- Stream generated text in real-time\r\n\r\n## License\r\n\r\n`AMD Multimodal gpt-fast` is released under the same license as the original GPTFast, [BSD 3](https://github.com/pytorch-labs/gpt-fast/main/LICENSE) license.\r\n\r\n## Acknowledgements\r\nThis project builds upon the original GPT-Fast by the PyTorch team and extends it with multimodal capabilities.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famd-agi%2Fgpt-fast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famd-agi%2Fgpt-fast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famd-agi%2Fgpt-fast/lists"}