{"id":14964620,"url":"https://github.com/rahulschand/gpu_poor","last_synced_at":"2025-05-14T20:08:54.745Z","repository":{"id":194299461,"uuid":"690511522","full_name":"RahulSChand/gpu_poor","owner":"RahulSChand","description":"Calculate token/s \u0026 GPU memory requirement for any LLM.  Supports llama.cpp/ggml/bnb/QLoRA quantization","archived":false,"fork":false,"pushed_at":"2024-12-03T20:52:37.000Z","size":1638,"stargazers_count":1303,"open_issues_count":7,"forks_count":73,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-11T01:37:03.449Z","etag":null,"topics":["ggml","gpu","huggingface","language-model","llama","llama2","llamacpp","llm","pytorch","quantization"],"latest_commit_sha":null,"homepage":"https://rahulschand.github.io/gpu_poor/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RahulSChand.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-12T10:41:29.000Z","updated_at":"2025-05-10T09:37:12.000Z","dependencies_parsed_at":"2023-09-12T19:47:01.322Z","dependency_job_id":"bbe34d2f-e671-4cdf-9bcc-3759cfe51e9b","html_url":"https://github.com/RahulSChand/gpu_poor","commit_stats":{"total_commits":59,"total_committers":3,"mean_commits":"19.666666666666668","dds":"0.27118644067796616","last_synced_commit":"a702c37a113baaaaa181bc5319255c9292f99240"},"previous_names":["rahulschand/gpu_poor"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RahulSChand%2Fgpu_poor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RahulSChand%2Fgpu_poor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RahulSChand%2Fgpu_poor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RahulSChand%2Fgpu_poor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RahulSChand","download_url":"https://codeload.github.com/RahulSChand/gpu_poor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254219373,"owners_count":22034397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ggml","gpu","huggingface","language-model","llama","llama2","llamacpp","llm","pytorch","quantization"],"created_at":"2024-09-24T13:33:31.305Z","updated_at":"2025-05-14T20:08:54.717Z","avatar_url":"https://github.com/RahulSChand.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Can my GPU run this LLM? \u0026 at what token/s?\n\n![Made with](https://img.shields.io/badge/logo-javascript-blue?logo=javascript)\n\nCalculates how much **GPU memory you need** and how much **token/s you can get** for any LLM \u0026 GPU/CPU.\n\nAlso breakdown of where it goes for training/inference with quantization (GGML/bitsandbytes/QLoRA) \u0026 inference frameworks (vLLM/llama.cpp/HF) supported\n\nLink: **https://rahulschand.github.io/gpu_poor/**\n\n### Demo\n\n![new_upload](https://github.com/RahulSChand/gpu_poor/assets/16897807/14250f55-e886-4cc6-9aeb-08532382860c)\n\n\n---\n\n## Use cases/Features\n\n#### 1. Calculate vRAM memory requirement 💾\n\n\u003cimg width=\"643\" alt=\"image\" src=\"https://github.com/RahulSChand/gpu_poor/assets/16897807/29577394-0efd-42fb-aaf4-282e9a45d5db\"\u003e\n\n---\n\n#### 2. Calculate ~token/s you can get ⏱️\n\n\u003cimg width=\"647\" alt=\"image\" src=\"https://github.com/RahulSChand/gpu_poor/assets/16897807/77627c9b-5fdd-44cf-8b7d-452ff0563a8a\"\u003e\n\n---\n\n#### 3. Approximate time for finetuning (ms per iteration) ⌛️\n\n\u003cimg width=\"764\" alt=\"image\" src=\"https://github.com/RahulSChand/gpu_poor/assets/16897807/e5fd08a1-abb9-4e00-ad45-ba9bb15ec546\"\u003e\n\n---\n\nFor memory, output is total vRAM \u0026 its breakdown. It looks like below\n\n```     \n{\n  \"Total\": 4000,\n  \"KV Cache\": 1000,\n  \"Model Size\": 2000,\n  \"Activation Memory\": 500,\n  \"Grad \u0026 Optimizer memory\": 0,\n  \"cuda + other overhead\":  500\n}\n```\n\nFor token/s, additional info looks like below\n\n```     \n{\n  \"Token per second\": 50,\n  \"ms per token\": 20,\n  \"Prompt process time (s)\": 5 s,\n  \"memory or compute bound?\": Memory,\n}\n```\n\nFor training, output is time for each forward pass (in ms)\n\n```     \n{\n  \"ms per iteration (forward + backward)\": 100,\n  \"memory or compute bound?\": Memory,\n}\n```\n\n---\n\n\n### Purpose\n\nmade this to check if you can run a particular LLM on your GPU. Useful to figure out the following\n\n1. How much token/s can I get?\n2. How much total time to finetune? \n3. What quantization will fit on my GPU?\n4. Max context length \u0026 batch-size my GPU can handle?\n5. Which finetuning? Full? LoRA? QLoRA?\n6. What is consuming my GPU memory? What to change to fit the LLM on GPU? \n\n---\n\n## Additional info + FAQ\n\n\n### Can't we just look at the model size \u0026 figure this out?\n\nFinding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama \u0026 vLLM this is 500MB). And during training both KV cache \u0026 activations \u0026 quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations \u0026 grad memory.\n\n\n### How reliable are the numbers?\nThe results can vary depending on your model, input data, cuda version \u0026 what quant you are using \u0026 it is impossible to predict exact values. I have tried to take these into account \u0026 make sure the results are within 500MB. Below table I cross-check 3b,7b \u0026 13b model memories given by the website vs. what what I get on my RTX 4090 \u0026 2060 GPUs. All values are within 500MB. \n\n\u003cimg width=\"604\" alt=\"image\" src=\"https://github.com/RahulSChand/gpu_poor/assets/16897807/3d49a422-f174-4537-b5fa-42adc4b15a89\"\u003e\n\n\n### How are the values calculated? \n\n`Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead`\n1. **Model size** = this is your `.bin` file size (divide it by 2 if Q8 quant \u0026 by 4 if Q4 quant).\n2. **KV-Cache** = Memory taken by KV (key-value) vectors. Size =  `(2 x sequence length x hidden size)` _per layer_. For huggingface this `(2 x 2 x sequence length x hidden size)` _per layer_. In training the whole sequence is processed at once (therefore KV cache memory = 0)\n3. **Activation Memory** = In forward pass every operation's output has to be stored for doing `.backward()`. For example if you do `output = Q * input` where `Q = (dim, dim)` and `input = (batch, seq, dim)` then output of shape `(batch, seq, dim)` will need to be stored (in fp16). This consumes the most memory in LoRA/QLoRA. In LLMs there are many such intermediate steps (after Q,K,V and after attention, after norm, after FFN1, FFN2, FFN3, after skip layer ....) Around 15 intermediate representations are saved _per layer_. \n4. **Optimizer/Grad memory** = Memory taken by `.grad` tensors \u0026 tensors associated with the optimizer (`running avg` etc.)\n5. **Cuda etc. overhead** = Around 500-1GB memory is taken by CUDA whenever cuda is loaded. Also there are additional overheads when you use any quantization (like bitsandbytes). There is not straightforward formula here (I assume 650 MB overhead in my calculations for cuda overhead)\n\n\n### Why are the results wrong?\nSometimes the answers might be very wrong in which case please open an issue here \u0026 I will try to fix it.\n\n\n---\n\n### TODO\n1. Add support for vLLM for token/s\n2. ~Add QLora~ ✅\n3. ~Add way to measure approximste tokens/s you can get for a particular GPU~ ✅\n4. ~Improve logic to get hyper-params from size~ (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅\n5. Add AWQ\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulschand%2Fgpu_poor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frahulschand%2Fgpu_poor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulschand%2Fgpu_poor/lists"}