An open API service indexing awesome lists of open source software.

https://github.com/ksm26/efficiently-serving-llms

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
https://github.com/ksm26/efficiently-serving-llms

batch-processing deep-learning-techniques inference-optimization large-scale-deployment machine-learning-operations model-acceleration model-inference-service model-serving optimization-techniques performance-enhancement scalability-strategies server-optimization serving-infrastructure text-generation

Last synced: about 2 months ago
JSON representation

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

Awesome Lists containing this project

README

        

# 🚀 [Efficiently Serving Large Language Models](https://www.deeplearning.ai/short-courses/efficiently-serving-llms/)

💻 Welcome to the "Efficiently Serving Large Language Models" course! Instructed by Travis Addair, Co-Founder and CTO at Predibase, this course will deepen your understanding of serving LLM applications efficiently.



## Course Summary
In this course, you'll delve into the optimization techniques necessary to efficiently serve Large Language Models (LLMs) to a large number of users. Here's what you can expect to learn and experience:

1. 🤖 **Auto-Regressive Models**: Understand how auto-regressive large language models generate text token by token.





2. 💻 **LLM Inference Stack**: Implement foundational elements of a modern LLM inference stack, including KV caching, continuous batching, and model quantization.





3. 🛠️ **LoRA Adapters**: Explore the details of how Low Rank Adapters (LoRA) work and how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously.




4. 🚀 **Hands-On Experience**: Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.

## Key Points
- 🔎 Learn techniques like KV caching to speed up text generation in Large Language Models (LLMs).
- 💻 Write code to efficiently serve LLM applications to a large number of users while considering performance trade-offs.
- 🛠️ Explore the fundamentals of Low Rank Adapters (LoRA) and how Predibase implements them in the LoRAX framework inference server.

## About the Instructor
🌟 **Travis Addair** is the Co-Founder and CTO at Predibase, bringing extensive expertise to guide you through efficiently serving Large Language Models (LLMs).

🔗 To enroll in the course or for further information, visit [deeplearning.ai](https://www.deeplearning.ai/short-courses/).