{"id":26758236,"url":"https://github.com/ksm26/efficiently-serving-llms","last_synced_at":"2025-08-02T14:06:23.471Z","repository":{"id":230222151,"uuid":"778396699","full_name":"ksm26/Efficiently-Serving-LLMs","owner":"ksm26","description":"Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.","archived":false,"fork":false,"pushed_at":"2024-04-12T14:31:00.000Z","size":2449,"stargazers_count":15,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-16T08:57:43.716Z","etag":null,"topics":["batch-processing","deep-learning-techniques","inference-optimization","large-scale-deployment","machine-learning-operations","model-acceleration","model-inference-service","model-serving","optimization-techniques","performance-enhancement","scalability-strategies","server-optimization","serving-infrastructure","text-generation"],"latest_commit_sha":null,"homepage":"https://www.deeplearning.ai/short-courses/efficiently-serving-llms/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ksm26.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-27T16:36:49.000Z","updated_at":"2025-06-09T07:11:01.000Z","dependencies_parsed_at":"2024-03-28T14:04:44.100Z","dependency_job_id":"f47923f2-6a10-4833-8ada-fb4e9c9249c6","html_url":"https://github.com/ksm26/Efficiently-Serving-LLMs","commit_stats":null,"previous_names":["ksm26/efficiently-serving-llms"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ksm26/Efficiently-Serving-LLMs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksm26%2FEfficiently-Serving-LLMs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksm26%2FEfficiently-Serving-LLMs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksm26%2FEfficiently-Serving-LLMs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksm26%2FEfficiently-Serving-LLMs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ksm26","download_url":"https://codeload.github.com/ksm26/Efficiently-Serving-LLMs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksm26%2FEfficiently-Serving-LLMs/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268401594,"owners_count":24244464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","deep-learning-techniques","inference-optimization","large-scale-deployment","machine-learning-operations","model-acceleration","model-inference-service","model-serving","optimization-techniques","performance-enhancement","scalability-strategies","server-optimization","serving-infrastructure","text-generation"],"created_at":"2025-03-28T16:18:58.952Z","updated_at":"2025-08-02T14:06:23.420Z","avatar_url":"https://github.com/ksm26.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 [Efficiently Serving Large Language Models](https://www.deeplearning.ai/short-courses/efficiently-serving-llms/)\n\n💻 Welcome to the \"Efficiently Serving Large Language Models\" course! Instructed by Travis Addair, Co-Founder and CTO at Predibase, this course will deepen your understanding of serving LLM applications efficiently.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/1_1.png\" height=\"350\"\u003e \n\u003c/p\u003e\n\n## Course Summary\nIn this course, you'll delve into the optimization techniques necessary to efficiently serve Large Language Models (LLMs) to a large number of users. Here's what you can expect to learn and experience:\n\n1. 🤖 **Auto-Regressive Models**: Understand how auto-regressive large language models generate text token by token.\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/1_2.png\" height=\"350\"\u003e \n\u003cimg src=\"images/1_3.png\" height=\"350\"\u003e \n\u003cimg src=\"images/1_4.png\" height=\"350\"\u003e \n\u003c/p\u003e\n\n2. 💻 **LLM Inference Stack**: Implement foundational elements of a modern LLM inference stack, including KV caching, continuous batching, and model quantization.\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/2_1.png\" height=\"300\"\u003e \n\u003cimg src=\"images/2_2.png\" height=\"300\"\u003e \n\u003cimg src=\"images/3_3.png\" height=\"300\"\u003e \n\u003c/p\u003e\n\n3. 🛠️ **LoRA Adapters**: Explore the details of how Low Rank Adapters (LoRA) work and how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously.\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/5_1.png\" height=\"300\"\u003e \n\u003cimg src=\"images/5_2.png\" height=\"300\"\u003e \n\u003c/p\u003e\n\n4. 🚀 **Hands-On Experience**: Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.\n\n## Key Points\n- 🔎 Learn techniques like KV caching to speed up text generation in Large Language Models (LLMs).\n- 💻 Write code to efficiently serve LLM applications to a large number of users while considering performance trade-offs.\n- 🛠️ Explore the fundamentals of Low Rank Adapters (LoRA) and how Predibase implements them in the LoRAX framework inference server.\n\n## About the Instructor\n🌟 **Travis Addair** is the Co-Founder and CTO at Predibase, bringing extensive expertise to guide you through efficiently serving Large Language Models (LLMs).\n\n🔗 To enroll in the course or for further information, visit [deeplearning.ai](https://www.deeplearning.ai/short-courses/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksm26%2Fefficiently-serving-llms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fksm26%2Fefficiently-serving-llms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksm26%2Fefficiently-serving-llms/lists"}