https://github.com/ksm26/efficiently-serving-llms

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
https://github.com/ksm26/efficiently-serving-llms

batch-processing deep-learning-techniques inference-optimization large-scale-deployment machine-learning-operations model-acceleration model-inference-service model-serving optimization-techniques performance-enhancement scalability-strategies server-optimization serving-infrastructure text-generation

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/ksm26/efficiently-serving-llms
Owner: ksm26
Created: 2024-03-27T16:36:49.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-12T14:31:00.000Z (over 1 year ago)
Last Synced: 2025-06-16T08:57:43.716Z (4 months ago)
Topics: batch-processing, deep-learning-techniques, inference-optimization, large-scale-deployment, machine-learning-operations, model-acceleration, model-inference-service, model-serving, optimization-techniques, performance-enhancement, scalability-strategies, server-optimization, serving-infrastructure, text-generation
Language: Jupyter Notebook
Homepage: https://www.deeplearning.ai/short-courses/efficiently-serving-llms/
Size: 2.34 MB
Stars: 15
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🚀 [Efficiently Serving Large Language Models](https://www.deeplearning.ai/short-courses/efficiently-serving-llms/)

💻 Welcome to the "Efficiently Serving Large Language Models" course! Instructed by Travis Addair, Co-Founder and CTO at Predibase, this course will deepen your understanding of serving LLM applications efficiently.

## Course Summary
In this course, you'll delve into the optimization techniques necessary to efficiently serve Large Language Models (LLMs) to a large number of users. Here's what you can expect to learn and experience:

1. 🤖 **Auto-Regressive Models**: Understand how auto-regressive large language models generate text token by token.

2. 💻 **LLM Inference Stack**: Implement foundational elements of a modern LLM inference stack, including KV caching, continuous batching, and model quantization.

3. 🛠️ **LoRA Adapters**: Explore the details of how Low Rank Adapters (LoRA) work and how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously.

4. 🚀 **Hands-On Experience**: Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.

## Key Points
- 🔎 Learn techniques like KV caching to speed up text generation in Large Language Models (LLMs).
- 💻 Write code to efficiently serve LLM applications to a large number of users while considering performance trade-offs.
- 🛠️ Explore the fundamentals of Low Rank Adapters (LoRA) and how Predibase implements them in the LoRAX framework inference server.

## About the Instructor
🌟 **Travis Addair** is the Co-Founder and CTO at Predibase, bringing extensive expertise to guide you through efficiently serving Large Language Models (LLMs).

🔗 To enroll in the course or for further information, visit [deeplearning.ai](https://www.deeplearning.ai/short-courses/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ksm26/efficiently-serving-llms

Awesome Lists containing this project

README