{"id":23780908,"url":"https://github.com/ajithvcoder/emlo4-session-09-ajithvcoder","last_synced_at":"2025-10-14T20:33:35.698Z","repository":{"id":263151874,"uuid":"889479282","full_name":"ajithvcoder/emlo4-session-09-ajithvcoder","owner":"ajithvcoder","description":"Deploying a Vision model with LitServe and a LLM - llama3.2 model with litserve  from The School of AI EMLO-V4 course assignment https://theschoolof.ai/#programs","archived":false,"fork":false,"pushed_at":"2024-11-19T03:50:29.000Z","size":1910,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-02T07:11:16.260Z","etag":null,"topics":["api-development","litserve","llama-3-2-1b","llama3-2","llm-api-development","lora","peft-fine-tuning-llm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ajithvcoder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-16T12:58:59.000Z","updated_at":"2024-11-28T08:35:48.000Z","dependencies_parsed_at":"2025-06-01T22:06:43.754Z","dependency_job_id":"dcfcc5fe-da8d-472c-bce9-48319c50ea61","html_url":"https://github.com/ajithvcoder/emlo4-session-09-ajithvcoder","commit_stats":null,"previous_names":["ajithvcoder/emlo4-session-09-ajithvcoder"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ajithvcoder/emlo4-session-09-ajithvcoder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-09-ajithvcoder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-09-ajithvcoder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-09-ajithvcoder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-09-ajithvcoder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ajithvcoder","download_url":"https://codeload.github.com/ajithvcoder/emlo4-session-09-ajithvcoder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-09-ajithvcoder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279020903,"owners_count":26086948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api-development","litserve","llama-3-2-1b","llama3-2","llm-api-development","lora","peft-fine-tuning-llm"],"created_at":"2025-01-01T11:18:17.733Z","updated_at":"2025-10-14T20:33:35.677Z","avatar_url":"https://github.com/ajithvcoder.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## EMLOV4-Session-09 Assignment - Deployment with LitServe\n\n**Note** : I have completed the bonus task of optimization with **torch-ao** by doing **4-bit quantatization, attention, static cache, max-autotune** and also i have done  **PEFT, LORA and attention techniques** for 4-bit optimization as a seperate experiment. TorchAO method got **56% increased performance** than PEFT-LORA optimization \n\n### Contents\n\n- [Requirements](#requirements)\n- [Development Command and Debug Commands](#development-command-and-debug-commands)\n    - [Task-1-LitServe-Cat-Dog-dataset](#task-1-litserve-cat-dog-dataset)\n        - [Theoretical Calculation](#theoretical-calculation)\n        - [Task-1-Experiment 1](#task-1-experiment-1)\n        - [Task-1-Experiment 2](#task-1-experiment-2)\n        - [Task-1-Experiment 3](#task-1-experiment-3)\n        - [Task-1-Experiment 4](#task-1-experiment-4)\n        - [Task-1-Experiment 5](#task-1-experiment-5)\n        - [Task-1-Experiment 6](#task-1-experiment-6)\n    - [Task-2 Deploy any llama based llm with LitServe](#task-2-deploy-any-llama-based-llm-with-litserve)\n        - [Task-2-Experiment 1](#task-2-experiment-1)\n        - [Task-2-Experiment 2](#task-2-experiment-2)\n        - [Task-2-Experiment 3](#task-2-experiment-3)\n        - [Task-2-Experiment 4](#task-2-experiment-4)\n- [Learnings](#learnings)\n- [Results](#results)\n\n### Requirements\n\n- Deploy the Cat-Dog or Dog-Breed Classifier with LitServe and benchmark the server performance.\n    - Find bottlenecks and optimize incrementally\n    - Document everything in your README with plots for comparison with Theoretical maximum throughput\n    - Any instance with GPU\n\n- Deploy any llama based llm\n    - Benchmark tokens/sec and compare with theoretical max\n    - No Batching\n    - BONUS: Use TorchAO to further push the model: https://huggingface.co/docs/transformers/main/en/quantization/torchao\n    - BONUS: You can use further more optimizations for LLMs from here: https://huggingface.co/docs/transformers/main/en/llm_optims\n\n### Development Command and Debug Commands\n\n**EC2 Instance - VS Code Desktop Connection**\n\n- Generate a key in local at this folder \"C:\\Users\\Ajith\\.ssh\u003e\" so Run `ssh-keygen -t rsa -b 4096`\n- You will have a `id_rsa.pem` file and `id_rsa` file\n- open ~/.ssh/authorized_keys in EC2 instance and paste the .pem file content\n- Change the .config file as below\n\n    ![config](./assets/snap_id_rsa.png)\n\n- Now go to VS code -\u003e Cntrl+Shift+P -\u003e Connect current window to remote host -\u003e  choose ip address which you want to connect\n\n**Docker commands**\n\n```\n# Build image\ndocker build -t cat_dog_image .\n\n# Create container with gpu option\ndocker run -d --gpus=all -v /home/ubuntu/dev/emlo4-session-09-ajithvcoder:/workspace  cat_dog_image\n\n# Use interactive prompt to container for development and debugging\ndocker exec -it fa30d   /bin/bash \n```\n\n**Push AMI to AWS Private AMI location**\n\n```\n# Configure with your accesskey and secret\naws configure\n\n# From you own instance where you are fetches the instance-id and pushes the AMI to private ami location\naws ec2 create-image \\\n    --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \\\n    --name \"Session-09-ami-Nov-19-1\" \\\n    --description \"AMI created programmatically from this instance\" \\\n    --no-reboot\n\n# You would get a ami-id like this ami-0af5900df6f0bfaf4\n```\n\n#### Build Command\n\n**GPU Usage**\n\n- Pass cuda parameter to trainer so that i trains with GPU\n- You need to pass `--gpus=all` to docker run command so that it uses host GPU\n\n**Debug Commands for development**\n\n\n**Install**\n\n```export UV_EXTRA_INDEX_URL: https://download.pytorch.org/whl/cpu```\n\nOR \n\n```uv sync --extra-index-url https://download.pytorch.org/whl/cpu ```\n\nif you are going by `--extra-index-url` method you might need to give it every time when u use `uv` command\n\n### Task-1-LitServe-Cat-Dog-dataset\n\n**EC2 - GPU Config**: g4dn.xlarge 16 GB RAM T4 - cuda 12.4\n\n**Server**\n\n```python src/server_baseline.py```\n\n**Client**\n\n```python tests/benchmark_base.py```\n\n**Experiment Explanation**:\n\nIn this series of experiments, server and client configurations were optimized incrementally to improve throughput and reduce bottlenecks in deploying a Cat-Dog/Dog-Breed classifier with LitServe. **Experiment 1** served as a baseline with no batching or worker configurations, yielding suboptimal GPU and CPU utilization due to lack of concurrency. **Experiment 2** introduced batch processing, slightly improving throughput as the server began to handle requests more efficiently by aggregating them. Adding workers (**Experiment 3**) significantly boosted performance by parallelizing request processing, leveraging multi-core CPU resources. Transitioning to float16 precision (**Experiment 4**) further optimized GPU utilization and throughput by reducing computational overhead, though with some trade-offs in single-threaded performance. Tuning batch timeout (**Experiment 5**) and max batch size (**Experiment 6**) refined batching behavior, leading to a balance between throughput and latency. Overall, the incremental optimizations showcased progressive utilization of hardware capabilities, with GPU and CPU reaching near-maximum efficiencies at higher concurrency levels and tuned configurations.\n\n**Theoretical Maximum Throughput**\n\nThe theoretical maximum throughput represents the upper bound of requests per second that a server can process under ideal conditions. It is determined by:\n\nHardware limitations (GPU computation capacity, CPU, memory bandwidth).\nPrecision (lower precision like float16 reduces computational load, allowing more inferences).\nConcurrency and batching efficiency (more concurrent threads/workers leverage the hardware optimally).\n\n*To calculate the theoretical maximum throughput:*\n\nEstimate the time for a single inference at peak GPU usage (e.g., based on the maximum GPU utilization observed in the benchmarks).\nDivide 1 by the inference time to get the per-second throughput.\nMultiply by the batch size and the number of workers to factor in parallel processing.\nFor your experiments, **maximum GPU usage (82.7%) and batch size 256** **with float16** suggest near-optimal GPU utilization. The server may approach ~300 reqs/sec under perfect conditions, considering diminishing returns beyond these optimizations.\n\n### Theoretical Calculation\n\n- T4 GPU TFLOPS - Float 32 - 8.1\n\n- T4 GPU TFLOPS - Float 16 - 65\n\nGet FLOPs of your model - `python tests/test_flops.py`\n\n**Float 32 - Thoeretical throughput**\n\n- Custom Model FLOPS = 4.45×10^9 = 4.45 GB FLOPS\n\n- T4 GPU = 8.1 * 10^12 = 8000.1 GFLOPS or 8.1 TFLOPS\n\n- Theoretical time (in seconds)= GPU TFLOPs / (FLOPs of Model)\n​​\n\n- Theoretical time = 4.45 / 8000.1 = 0.000549 seconds = 549 micro seconds\n\n*Thoeretical throughput*\n\n- Thoeretical throughput (request/second) = (GPU TFLOPS * 10^12)/(MODEL FLOPS)\n\n- Inference per second = 1/ 549 micro second = 1820 request per second\n\nWith 64 batchsize we were able to get only `160.85 reqs/sec` in api serving and we have not yet reached even near the practical baseline throughput model also\n\n**Float 16**\n\n- Custom Model FLOPS = 4.45×10^9 = 4.45 GB FLOPS\nT4 GPU = 8.1 * 10^12 = 65000.1 GFLOPS or 65.1 TFLOPS\n\n- Theoretical time (in seconds)= GPU TFLOPs / (FLOPs of Model)\n​​\n- Theoretical time = 4.45 / 65000.1 = 6.83×10^(-5) seconds\n\n**Thoeretical throughput**\n\n- Thoeretical throughput (request/second) = (GPU TFLOPS * 10^12)/(MODEL FLOPS)\n\n- Inference per second = 1/ 6.83×10^(-5) = 14,640 request per second\n\nWith 256 batchsize we were able to get only `152.51 reqs/sec` in api serving and we have not yet reached even near the practical baseline throughput model also\n\n**Reference**\n\n- [Test Program](./tests/test_flops.py)\n\n- [T4 GPU GFLOPS](https://www.dell.com/support/kbdoc/en-us/000132094/deep-learning-performance-on-t4-gpus-with-mlperf-benchmarks)\n\n\n### Task-1-Experiment 1\n\n**Server**: server_baseline.py\n\n*precision* : Full (float32) | *max_batch_size* : 4096 | *batch_timeout* : 0.01 | *workers* : 0\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | **benchmark_api.num_requests** : **128**\n\n*Result plot*\n\n```\nRunning baseline throughput tests...\nBatch size 1: 181.62 reqs/sec\nBatch size 8: 297.16 reqs/sec\nBatch size 32: 295.98 reqs/sec\nBatch size 64: 276.82 reqs/sec\nBatch size 128: 280.78 reqs/sec\nBatch size 256: 280.83 reqs/sec\nRunning API benchmarks...\nConcurrency 1: 37.74 reqs/sec, CPU: 20.6%, GPU: 13.1%\nConcurrency 8: 82.58 reqs/sec, CPU: 43.6%, GPU: 38.2%\nConcurrency 32: 80.53 reqs/sec, CPU: 45.1%, GPU: 31.8%\nConcurrency 64: 78.41 reqs/sec, CPU: 37.7%, GPU: 36.2%\nConcurrency 128: 90.05 reqs/sec, CPU: 49.1%, GPU: 39.2%\nConcurrency 256: 90.58 reqs/sec, CPU: 43.7%, GPU: 38.5%\n```\n\n![](./assets/benchmark_results_baseline_128.png)\n\n**Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | **benchmark_api.num_requests** : **256**\n\n\n*Result plot*\n\n```\nRunning baseline throughput tests...\nBatch size 1: 181.18 reqs/sec\nBatch size 8: 284.97 reqs/sec\nBatch size 32: 293.08 reqs/sec\nBatch size 64: 280.36 reqs/sec\nBatch size 128: 283.41 reqs/sec\nBatch size 256: 280.35 reqs/sec\nRunning API benchmarks...\nConcurrency 1: 82.81 reqs/sec, CPU: 40.7%, GPU: 41.4%\nConcurrency 8: 91.88 reqs/sec, CPU: 47.2%, GPU: 37.7%\nConcurrency 32: 90.56 reqs/sec, CPU: 48.3%, GPU: 35.9%\nConcurrency 64: 89.32 reqs/sec, CPU: 46.3%, GPU: 38.8%\nConcurrency 128: 86.09 reqs/sec, CPU: 40.9%, GPU: 39.8%\nConcurrency 256: 85.84 reqs/sec, CPU: 38.5%, GPU: 40.2%\n```\n\n![](./assets/benchmark_results_baseline_256.png)\n\n### Task-1-Experiment 2\n\nGoing with *benchmark_api.num_requests*=256 as it gives good utilization\n\n**Server**: server_batch_fullp_w0.py\n\n**Batch processing**\n\n*precision* : Full (float32) | *max_batch_size* : 4096 | *batch_timeout* : 0.01 | *workers* : 0\n\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | *benchmark_api.num_requests* : 256\n\n```\nRunning baseline throughput tests...\nBatch size 1: 183.52 reqs/sec\nBatch size 8: 289.45 reqs/sec\nBatch size 32: 292.08 reqs/sec\nBatch size 64: 276.93 reqs/sec\nBatch size 128: 280.14 reqs/sec\nBatch size 256: 280.34 reqs/sec\n\\nRunning API benchmarks...\nConcurrency 1: 46.17 reqs/sec, CPU: 24.9%, GPU: 18.9%\nConcurrency 8: 100.87 reqs/sec, CPU: 40.5%, GPU: 25.5%\nConcurrency 32: 114.23 reqs/sec, CPU: 44.6%, GPU: 35.9%\nConcurrency 64: 111.32 reqs/sec, CPU: 49.0%, GPU: 36.3%\nConcurrency 128: 117.67 reqs/sec, CPU: 42.4%, GPU: 53.0%\nConcurrency 256: 124.09 reqs/sec, CPU: 39.3%, GPU: 40.4%\n```\n\n![](./assets/benchmark_results_batchw0_256.png)\n\n\n\n### Task-1-Experiment 3\n\n**Server**: server_batch_fullp.py\n\n*precision* : Full (float32) | *max_batch_size* : 4096 | *batch_timeout* : 0.01  | **workers** : **4**\n\n workers 4\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | *benchmark_api.num_requests* : 256\n\n\n```\nRunning baseline throughput tests...\nBatch size 1: 161.91 reqs/sec\nBatch size 8: 291.14 reqs/sec\nBatch size 32: 292.65 reqs/sec\nBatch size 64: 278.49 reqs/sec\nBatch size 128: 281.33 reqs/sec\nBatch size 256: 280.38 reqs/sec\nRunning API benchmarks...\nConcurrency 1: 41.32 reqs/sec, CPU: 36.9%, GPU: 20.4%\nConcurrency 8: 132.28 reqs/sec, CPU: 93.0%, GPU: 49.8%\nConcurrency 32: 148.67 reqs/sec, CPU: 99.5%, GPU: 42.4%\nConcurrency 64: 160.85 reqs/sec, CPU: 99.5%, GPU: 60.2%\nConcurrency 128: 131.51 reqs/sec, CPU: 82.2%, GPU: 50.8%\nConcurrency 256: 130.53 reqs/sec, CPU: 71.3%, GPU: 82.7%\n```\n\n![](./assets/benchmark_results_batchw4_256.png)\n\n### Task-1-Experiment 4\n\n**Server**: server_batch_halfp.py\n\n**precision** : **Half (float16)** | *max_batch_size* : 4096 | *batch_timeout* : 0.01 | *workers* : 4\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | *benchmark_api.num_requests* : 256\n\n\n\n```\nRunning baseline throughput tests...\nBatch size 1: 157.38 reqs/sec\nBatch size 8: 291.67 reqs/sec\nBatch size 32: 292.66 reqs/sec\nBatch size 64: 279.02 reqs/sec\nBatch size 128: 281.80 reqs/sec\nBatch size 256: 281.31 reqs/sec\n\\nRunning API benchmarks...\nConcurrency 1: 43.53 reqs/sec, CPU: 36.7%, GPU: 36.6%\nConcurrency 8: 112.87 reqs/sec, CPU: 83.7%, GPU: 65.9%\nConcurrency 32: 121.17 reqs/sec, CPU: 86.0%, GPU: 67.2%\nConcurrency 64: 136.24 reqs/sec, CPU: 92.5%, GPU: 59.4%\nConcurrency 128: 133.77 reqs/sec, CPU: 100.0%, GPU: 52.8%\nConcurrency 256: 137.70 reqs/sec, CPU: 77.8%, GPU: 81.7%\n```\n\n![](./assets/benchmark_results_batchw4_half_001_2_256.png)\n\n\n### Task-1-Experiment 5\n\n**Server**: server_batch_halfp.py\n\n*precision* : Half (float16) | *max_batch_size* : 4096 | **batch_timeout** : **0.05** | *workers* : 4\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | *benchmark_api.num_requests* : 256\n\n\n```\nRunning baseline throughput tests...\nBatch size 1: 156.51 reqs/sec\nBatch size 8: 290.05 reqs/sec\nBatch size 32: 291.75 reqs/sec\nBatch size 64: 277.99 reqs/sec\nBatch size 128: 281.86 reqs/sec\nBatch size 256: 280.80 reqs/sec\n\\nRunning API benchmarks...\nConcurrency 1: 19.28 reqs/sec, CPU: 22.0%, GPU: 16.0%\nConcurrency 8: 90.43 reqs/sec, CPU: 61.9%, GPU: 45.9%\nConcurrency 32: 150.88 reqs/sec, CPU: 93.1%, GPU: 60.2%\nConcurrency 64: 142.15 reqs/sec, CPU: 93.0%, GPU: 56.6%\nConcurrency 128: 126.65 reqs/sec, CPU: 76.5%, GPU: 68.5%\nConcurrency 256: 152.51 reqs/sec, CPU: 83.2%, GPU: 76.3%\n```\n\n![](./assets/benchmark_results_batchw4_half_005_256.png)\n\n### Task-1-Experiment 6:\n\n**Server**: server_batch_halfp.py\n\n*precision* : Half (float16) | **max_batch_size** : **256** | *batch_timeout* : 0.05 | *workers* : 4\n\n**Client**: tests/benchmark_base.py\n\n**Client Hyper prameter Setting**\n\n*batch_sizes* : [1, 8, 32, 64, 128, 256] | *benchmark_api.num_requests* : 256\n\n\n```\nRunning baseline throughput tests...\nBatch size 1: 157.74 reqs/sec\nBatch size 8: 288.52 reqs/sec\nBatch size 32: 293.95 reqs/sec\nBatch size 64: 279.16 reqs/sec\nBatch size 128: 282.15 reqs/sec\nBatch size 256: 281.18 reqs/sec\n\\nRunning API benchmarks...\nConcurrency 1: 19.29 reqs/sec, CPU: 22.2%, GPU: 14.1%\nConcurrency 8: 90.11 reqs/sec, CPU: 62.6%, GPU: 49.3%\nConcurrency 32: 144.32 reqs/sec, CPU: 97.6%, GPU: 50.0%\nConcurrency 64: 130.22 reqs/sec, CPU: 85.8%, GPU: 58.6%\nConcurrency 128: 134.48 reqs/sec, CPU: 91.2%, GPU: 59.0%\nConcurrency 256: 134.46 reqs/sec, CPU: 78.6%, GPU: 58.3%\n```\n\n![](./assets/benchmark_results_batchw4_half_005_maxl_256.png)\n\n\n### Task-2 Deploy any llama based llm with LitServe\n\n**Basic LLM Working of Llama 8B and 1B Instruct models**\n\n- ```python src/sample_test_working.py```\n\n- ```python src/sample_test_llama32_working.py```\n\n**Usage**\n\n### Task-2-Experiment 1\n\n**Model** : `unsloth/Llama-3.2-1B-Instruct`\n\n**Optimization** : PERF + LORA - 4 BIT quantization\n\n**Server**\n\n- ```python src/server_llm_llama3_2.py```\n\n**Client**\n\n- ```python tests/test_llm_llama_3_2.py```\n\n```\nBenchmakring for unsloth/Llama-3.2-1B-Instruct with max_tokens 250\nRun no 0 - model_throughput(tokens/sec) - 15.91032594111205 | theoretical_max - 150 \nRun no 1 - model_throughput(tokens/sec) - 15.918434793146428 | theoretical_max - 150 \nRun no 2 - model_throughput(tokens/sec) - 15.946354350946034 | theoretical_max - 150 \nRun no 3 - model_throughput(tokens/sec) - 15.904872901654354 | theoretical_max - 150 \nRun no 4 - model_throughput(tokens/sec) - 15.948213135458118 | theoretical_max - 150 \n```\n\n### Task-2-Experiment 2\n\n**Model** : `unsloth/Llama-3.2-1B-Instruct`\n\n**Optimization** : PERF + LORA - 4 BIT quantization\n\n**Server**\n\n- ```python src/server_llm_llama3_2.py```\n\n**Client**\n\n- ```python tests/test_llm_llama_3_2.py```\n\n```\nBenchmakring for unsloth/Llama-3.2-1B-Instruct with max_tokens 500\nRun no 0 - model_throughput(tokens/sec) - 15.875130450471548 | theoretical_max - 150 \nRun no 1 - model_throughput(tokens/sec) - 15.891949508097365 | theoretical_max - 150 \nRun no 2 - model_throughput(tokens/sec) - 15.87916840600827 | theoretical_max - 150 \nRun no 3 - model_throughput(tokens/sec) - 15.884255381263513 | theoretical_max - 150 \nRun no 4 - model_throughput(tokens/sec) - 15.89836775118845 | theoretical_max - 150\n```\n\n### Task-2-Experiment 3\n\n**Model** : `unsloth/Llama-3.2-1B-Instruct-bnb-4bit`\n\n**Optimization** : torch-ao 4 bit quantatization, attention, static cache, max-autotune\n\n**Server**\n\n- ```python src/server_llm_llama3_2_torchao.py```\n\n**Client**\n\n- ```python tests/test_llm_llama_3_2.py```\n\n- change the model name while running but how ever `server_llm_llama3_2_torchao.py` is hardcoded with correct model name for torchao 4-bit model\n\n```\nBenchmakring for unsloth/Llama-3.2-1B-Instruct-bnb-4bit with max_tokens 250\nRun no 0 - model_throughput(tokens/sec) - 16.645872485326894 | theoretical_max - 150 \nRun no 1 - model_throughput(tokens/sec) - 24.916799895716238 | theoretical_max - 150 \nRun no 2 - model_throughput(tokens/sec) - 24.889223601626053 | theoretical_max - 150 \nRun no 3 - model_throughput(tokens/sec) - 24.810227143607555 | theoretical_max - 150 \nRun no 4 - model_throughput(tokens/sec) - 24.63610144578302 | theoretical_max - 150 \n```\n\n### Task-2-Experiment 4\n\n**Model** : `unsloth/Llama-3.2-1B-Instruct-bnb-4bit`\n\n**Optimization** : torch-ao 4 bit quantatization, attention, static cache, max-autotune\n\n**Server**\n\n- ```python src/server_llm_llama3_2_torchao.py```\n\n**Client**\n\n- ```python tests/test_llm_llama_3_2.py```\n\n- change the model name while running but how ever `server_llm_llama3_2_torchao.py` is hardcoded with correct model name for torchao 4-bit model\n\n```\nBenchmakring for unsloth/Llama-3.2-1B-Instruct-bnb-4bit with max_tokens 500\nRun no 0 - model_throughput(tokens/sec) - 24.369102739963743 | theoretical_max - 150 \nRun no 1 - model_throughput(tokens/sec) - 24.559441143452542 | theoretical_max - 150 \nRun no 2 - model_throughput(tokens/sec) - 24.798074006805344 | theoretical_max - 150 \nRun no 3 - model_throughput(tokens/sec) - 24.701174057950034 | theoretical_max - 150 \nRun no 4 - model_throughput(tokens/sec) - 24.473459727389834 | theoretical_max - 150 \n```\n\nWe can observe that after doing quantatization, attention, static cache, max-autotune we are able to get 24 tokens per second which is **56.98 %** increase.\n\n### Theoretical Throughput calculation for LLama-1B model\n\n*Config - g4dn.xlarge - T4 16 GB ram - accelerator memory bandwidth = 300 GB/s*\n\n- time/token = total number of bytes moved (the model weights) / accelerator memory bandwidth\n- time/token = (2 * 1B) bytes / (300 GB/s) = 6.67 ms/token\n- Tokens/Second = 150 tokens/second\n\n- Reference\n    - [LLM transformer inference-guide](https://www.baseten.co/blog/llm-transformer-inference-guide/)\n\n### Task-2 Bonus assignment\n\n**Torch-AO**\n\n- **torch-ao** is used for doing **4-bit quantatization**, eager mode attention as high end GPU is needed for flash-attention-2, static cache, max-autotune for \n\n**Static Cache Implementation**\n- Sets `model.generation_config.cache_implementation = \"static\"` for optimized token caching during generation, reducing redundant computation and improving inference speed.\n\n**4-bit Quantization**\n- Utilizes `BitsAndBytesConfig` for efficient model compression, enabling lower memory usage without significant loss of performance.\n\n**LoRA Configuration**\n**Key Parameters**:\n- r=16: Defines the rank for low-rank updates, balancing performance and efficiency.\n- lora_alpha=32: Scaling factor for LoRA updates.\n- lora_dropout=0.05: Introduces slight regularization to prevent overfitting.\n- bias=\"none\": Excludes bias terms from updates for simplicity.\n- task_type=\"CAUSAL_LM\": Configures LoRA for causal language modeling tasks.\n- target_modules: Specifies layers to apply LoRA.\nBenefits:\n- Memory Efficiency: 4-bit quantization and LoRA reduce model size while maintaining accuracy.\n- Speed Optimization: Static caching accelerates inference by reusing cached tokens.\n- Scalability: LoRA enables efficient fine-tuning for specific tasks without retraining the entire model.\n\n```\npeft_config = LoraConfig(\n        r=16,\n        lora_alpha=32,\n        lora_dropout=0.05,\n        bias=\"none\",\n        task_type=\"CAUSAL_LM\",\n        target_modules=self.modules\n    )\n```\n\n### Learnings\n- Lernt about deploying in LitServe and how batch processing, num_workers and other parameters affect the throughput(requests per second) and GPU utilization efficiency.\n\n- Learnt to find proper prompt and proper model for a llm task. Eg: We need to chose `Llama-3.2-1B-Instruct` instead of `Llama-3.2-1B` which is a base model that was not fine tuned for chat completion.\n\n- We need to refer github codes or proper documentation for prompting or chat template specifically for a model `Llama-3.2-1B-Instruct` . Else we would be getting irrelvant junk values\n\n- Use 4 bit models like `Llama-3.2-1B-Instruct-bnb-4bit` while doing torchao - 4 bit quantatization to avoid errors\n\n### Results\n\n- Deploy the Cat-Dog or Dog-Breed Classifier with LitServe and benchmark the server performance.\n    - Screenshots and benchmark info attached above in Section 1 \n\n- Deploy any llama based llm with LitServe\n    - Theoretical max throughput = 150 tokens per second \n    - In this repo normal PEFT-LORA 4 bit with eager attention - 15.87 tokens per second\n    - TorchAO 4 bit - We can observe that after doing quantatization, attention, static cache, max-autotune we are able to get 24 tokens per second which is **56.98 %** increase from PEFT-LORA technique.\n\nNote: For llm - litserve task i have not used steaming method. I have went with general non-streaming method.\n\n\n### Group Members\n\n1. Ajith Kumar V (myself)\n2. Pravin Sagar\n3. Pratyush\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Femlo4-session-09-ajithvcoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fajithvcoder%2Femlo4-session-09-ajithvcoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Femlo4-session-09-ajithvcoder/lists"}