{"id":17956949,"url":"https://github.com/fminference/flexllmgen","last_synced_at":"2025-07-27T21:31:42.337Z","repository":{"id":67653083,"uuid":"602270517","full_name":"FMInference/FlexLLMGen","owner":"FMInference","description":"Running large language models on a single GPU for throughput-oriented scenarios.","archived":false,"fork":false,"pushed_at":"2024-10-28T03:05:41.000Z","size":38932,"stargazers_count":9206,"open_issues_count":57,"forks_count":548,"subscribers_count":111,"default_branch":"main","last_synced_at":"2024-11-25T05:04:23.239Z","etag":null,"topics":["deep-learning","gpt-3","high-throughput","large-language-models","machine-learning","offloading","opt"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FMInference.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-15T21:18:53.000Z","updated_at":"2024-11-25T03:26:22.000Z","dependencies_parsed_at":"2024-04-19T20:37:40.908Z","dependency_job_id":"1fee7e94-da7a-48df-85ae-12e02a45f234","html_url":"https://github.com/FMInference/FlexLLMGen","commit_stats":{"total_commits":94,"total_committers":17,"mean_commits":5.529411764705882,"dds":0.4787234042553191,"last_synced_commit":"3834bb3eba206f5142ce555b44ad4979617eb989"},"previous_names":["fminference/flexigen","fminference/flexgen","fminference/flexllmgen","ying1123/flexgen"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FFlexLLMGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FFlexLLMGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FFlexLLMGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FFlexLLMGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FMInference","download_url":"https://codeload.github.com/FMInference/FlexLLMGen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226422988,"owners_count":17622610,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","gpt-3","high-throughput","large-language-models","machine-learning","offloading","opt"],"created_at":"2024-10-29T10:48:01.885Z","updated_at":"2024-11-26T01:02:19.863Z","avatar_url":"https://github.com/FMInference.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FlexLLMGen: High-throughput Generative Inference of Large Language Models with a Single GPU [[paper](https://arxiv.org/abs/2303.06865)]\n\nFlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.\n\n## Motivation\n\nIn recent years, large language models (LLMs) have shown great performance across a \nwide range of tasks. Increasingly, LLMs have been applied not only to interactive \napplications (such as chat), but also to many \"back-of-house\" tasks.\nThese tasks include benchmarking, information extraction, data wrangling, and form processing.\n\nOne key characteristic of these applications is that they are **throughput-oriented**: they require\nrunning LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's\ncorpus, or all the tasks in the [HELM](https://crfm.stanford.edu/helm/latest/) benchmark.\nThese workloads are less sensitive to latency - the user starts up a job and lets it run overnight -\nbut increasing throughput is critical for reducing costs.\nThroughput is a measure of tokens processed per second over the job's entire runtime (which can be hours).\nThroughput-oriented workloads provide opportunities to trade off latency for higher throughput, which\nmakes it easier to take advantage of low-cost commodity GPUs. \n\nThe goal of FlexLLMGen is to create a high-throughput system to enable new and exciting applications of \nfoundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU\ninstead of expensive systems.\n\nCheck out the [examples](#examples) of what you can run on a single commodity GPU with FlexLLMGen, including benchmarking and data wrangling.\n\n❌ **Limitation**. As an offloading-based system running on weak GPUs, FlexLLMGen also has its limitations.\nFlexLLMGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.\nFlexLLMGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.\n\n----------\n\nThis project was made possible thanks to a collaboration with\n\n\u003ca href=\"https://cs.stanford.edu/\"\u003e\u003cimg src=\"https://identity.stanford.edu/wp-content/uploads/sites/3/2020/06/wordmark-nospace-red.png\" height=\"20\"\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003ca href=\"https://sky.cs.berkeley.edu/\"\u003e\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/8/82/University_of_California%2C_Berkeley_logo.svg/1280px-University_of_California%2C_Berkeley_logo.svg.png\" height=\"22\"\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003ca href=\"https://www.andrew.cmu.edu/user/beidic/\"\u003e\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/9/9b/Carnegie_Mellon_wordmark.svg\" height=\"20\"\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003ca href=\"https://www.together.xyz/\"\u003e\u003cimg src=\"https://images.squarespace-cdn.com/content/v1/6358bea282189a0adf57fe16/eef09191-631f-40d9-9bfd-f875b25bcf0b/together-logo-black-transparent2.png\" height=\"20\"\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003ca href=\"https://research.yandex.com/\"\u003e\u003cimg src=\"https://storage.yandexcloud.net/yandex-research/assets/yandex_research.png\" height=\"20\"\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003ca href=\"https://ds3lab.inf.ethz.ch/\"\u003e\u003cimg src=\"https://user-images.githubusercontent.com/1608867/220273382-c09669b3-42fd-47c2-b88c-7ed55cb43820.png\" height=\"20\"\u003e\u003c/a\u003e\n\n----------\n\n## Content\n- [Installation](#installation)\n- [Usage and Examples](#usage-and-examples)\n  - [Get Started with a Single GPU](#get-started-with-a-single-gpu)\n  - [Run HELM Benchmark with FlexLLMGen](#run-helm-benchmark-with-flexllmgen)\n  - [Run Data Wrangling Tasks with FlexLLMGen](#run-data-wrangling-tasks-with-flexllmgen)\n  - [Scaling to Distributed GPUs](#scaling-to-distributed-gpus)\n  - [API Example](#api-example)\n  - [Frequently Asked Questions](#frequently-asked-questions)\n- [Performance Results](#performance-results)\n- [How It Works](#how-it-works)\n- [Roadmap](#roadmap)\n\n## Installation\nRequirements:  \n - PyTorch \u003e= 1.12 [(Help)](https://pytorch.org/get-started/locally/)\n\n### Method 1: With pip\n```\npip install flexllmgen\n```\n\n### Method 2: From source\n```\ngit clone https://github.com/FMInference/FlexLLMGen.git\ncd FlexLLMGen\npip install -e .\n```\n\n## Usage and Examples\n\n### Get Started with a Single GPU\n\n#### OPT-1.3B\nTo get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required.\nFlexLLMGen will automatically download weights from Hugging Face.\n```\npython3 -m flexllmgen.flex_opt --model facebook/opt-1.3b\n```\n\nYou should see some text generated by OPT-1.3B and the benchmark results.\n\n#### OPT-30B\nTo run large models like OPT-30B, you will need to use CPU offloading. You can try commands below.\nThe `--percent` argument specifies the offloading strategy for parameters, attention cache and hidden states separately.\nThe exact meaning of this argument can be found [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/flexllmgen/flex_opt.py#L1271-L1279).\n```\npython3 -m flexllmgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0\n```\n\n#### OPT-175B\nTo run OPT-175B, you need to download the weights from [metaseq](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT) and convert the weights into Alpa [format](https://alpa.ai/tutorials/opt_serving.html#convert-opt-175b-weights-into-alpa-formats).\nYou can then try to offloading all weights to disk by\n```\npython3 -m flexllmgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER\n```\n\n### Run HELM Benchmark with FlexLLMGen\nFlexLLMGen can be integrated into [HELM](https://crfm.stanford.edu/helm), a language model benchmark framework, as its execution backend.\nYou can use the commands below to run a Massive Multitask Language Understanding (MMLU) [scenario](https://crfm.stanford.edu/helm/latest/?group=mmlu) with a single T4 (16GB) GPU and 200GB of DRAM.\n```\npip install crfm-helm\npython3 -m flexllmgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100\n```\nNote that only a subset of HELM scenarios is tested. See more tested scenarios [here](flexllmgen/apps/helm_passed_30b.sh).\n\n### Run Data Wrangling Tasks with FlexLLMGen\nYou can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexllmgen/apps/data_wrangle).\n\n### Scaling to Distributed GPUs\nIf you have multiple machines with GPUs, FlexLLMGen can combine offloading with pipeline parallelism to allow scaling.\nFor example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexLLMGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation.\nBut to have scaled performance, you should have GPUs on distributed machines.\nSee examples [here](https://github.com/FMInference/FlexLLMGen/tree/main/benchmark/flexllmgen#distributed-gpus).\n\n### API Example\nWe demonstrate the usage of FlexLLMGen API in [completion.py](flexllmgen/apps/completion.py).\nThis example shows how to run generation for two sentences.\nTo get the best throughput out of FlexLLMGen, you typically need to batch more sentences.\n\n#### Generation API\nFlexLLMGen has a generation API following the style of Hugging Face's transformers.\n```python\noutput_ids = model.generate(\n\tinput_ids,\n\tdo_sample=True,\n\ttemperature=0.7,\n\tmax_new_tokens=32,\n\tstop=stop)\n```\n\n#### Example Commands\nYou can use the example commands below.\nIf you do not have enough GPU/CPU memory, see the [Handle Out-Of-Memory](#handle-out-of-memory) section.\n\n```\n# Complete with OPT-6.7B. You need at least 15GB of GPU memory.\npython3 -m flexllmgen.apps.completion --model facebook/opt-6.7b\n```\n\n```\n# Complete with OPT-30B. You need about 90GB of CPU memory.\npython3 -m flexllmgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0\n```\n\n```\n# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.\npython3 -m flexllmgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0\n```\n\n### Frequently Asked Questions\n\n#### How to set the offloading strategy and `--percent`?\nWe will release an automatic policy optimizer later, but now you have to manually try a few strategies.\nThe idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary.\nYou can see the reference strategies in our benchmark [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/benchmark/flexllmgen/bench_suite.py#L39-L79).\nTo avoid out-of-memory, you can tune the `--percent` to offload more tensors to the CPU and disk.\n\n\n#### How to handle out-of-memory?\nIf you do not have enough GPU/CPU memory, here are a few things you can try.\nThey save more memory but run slower.\n\n- Do not pin weights by adding `--pin-weight 0`. This can reduce the weight memory usage on CPU by around 20% or more.\n- Enable weight compression by adding `--compress-weight`. This can reduce the weight memory usage by around 70%.\n- Offload all weights to disk by using `--percent 0 0 100 0 100 0`. This requires very little CPU and GPU memory.\n\n## Performance Results\n### Generation Throughput (token/s)\nThe corresponding effective batch sizes and lowest offloading devices are in parentheses. Please see [here](benchmark/batch_size_table.md) for more details.\n| System | OPT-6.7B | OPT-30B | OPT-175B |\n| ------ | -------- | ------- | -------- |\n| Hugging Face Accelerate  | 25.12 (2 on GPU)  | 0.62 (8 on CPU) | 0.01 (2 on disk) |\n| DeepSpeed ZeRO-Inference | 9.28 (16 on CPU)  | 0.60 (4 on CPU) | 0.01 (1 on disk) |\n| Petals                 | 8.25 (2 on GPU) | 2.84 (2 on GPU) | 0.08 (2 on GPU) |\n| FlexLLMGen                  | 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |\n| FlexLLMGen with Compression | **29.12** (72 on GPU) | **8.38** (512 on CPU) | **1.12** (144 on CPU) |\n\n- Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.  \n- Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to **a large value** that maximizes the generation throughput for each system.\n- Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).  \n\nHow to [reproduce](benchmark/flexllmgen).\n\n### Latency-Throughput Trade-Off\nThe figure below shows the latency and throughput trade-off of three offloading-based systems on OPT-175B (left) and OPT-30B (right).\nFlexLLMGen achieves a new Pareto-optimal frontier with significantly higher maximum throughput for both models.\nOther systems cannot further increase throughput due to out-of-memory.\n\"FlexLLMGen(c)\" is FlexLLMGen with compression.\n\n\u003cimg src=\"https://github.com/FMInference/FlexLLMGen/blob/main/docs/throughput_vs_latency.jpg\" alt=\"image\" width=\"500\"\u003e\u003c/img\u003e\n\n## How It Works\nFlexLLMGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexLLMGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.\n\nOne key idea of FlexLLMGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods,\nbut the I/O efficiency of offloading can be greatly boosted for throughput-oriented scenarios (see the figure above).\nFlexLLMGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.\n\n\u003cimg src=\"https://github.com/FMInference/FlexLLMGen/raw/main/docs/block_schedule.jpg\" alt=\"image\" width=\"500\"\u003e\u003c/img\u003e\n\nMore technical details see our [paper](https://arxiv.org/abs/2303.06865).\n\n## Roadmap\nWe plan to work on the following features.\n\n- [ ] Optimize the performance for multiple GPUs on the same machine\n- [ ] Support more models (BLOOM, CodeGen, GLM)\n- [X] Release the cost model and policy optimizer\n- [ ] Macbook Support (M1 and M2)\n- [ ] AMD Support\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffminference%2Fflexllmgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffminference%2Fflexllmgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffminference%2Fflexllmgen/lists"}