awesome-compound-ai-systems
Papers about infrastructure (deployment & serving) and systems for compound AI
https://github.com/outerport/awesome-compound-ai-systems
Last synced: 11 days ago
JSON representation
-
Papers
-
Caching
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
- Do Large Language Models Need a Content Delivery Network?
- Compute Or Load KV Cache? Why Not Both?
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
- Do Large Language Models Need a Content Delivery Network?
- Compute Or Load KV Cache? Why Not Both?
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
- Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
- InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
- MoE-Infinity: Offloading-Efficient MoE Model Serving
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
- Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
- InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
- MoE-Infinity: Offloading-Efficient MoE Model Serving
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference
- ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
-
- Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design
- Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems
- Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
- MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
- Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
-
Scheduling and Orchestration
- Punica: Multi-Tenant LoRA Serving
- SLoRA: Scalable Serving of Thousands of LoRA Adapters
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- Llumnix: Dynamic Scheduling for Large Language Model Serving
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving
- TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- Llumnix: Dynamic Scheduling for Large Language Model Serving
- Punica: Multi-Tenant LoRA Serving
- SLoRA: Scalable Serving of Thousands of LoRA Adapters
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving
- TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
- ALTO: An Efficient Network Orchestrator for Compound AI Systems
-
Compression
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
- MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
- MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
- Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
- Interfering with Interference: Blind Shuffling and Superposition for Better Multi-Model Compression
- Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
- Interfering with Interference: Blind Shuffling and Superposition for Better Multi-Model Compression
-
Communicaiton
-
-
Case Studies
-
Communicaiton
- Improving Planning with Large Language Models: A Modular Agentic Architecture
- Self-Evolving Multi-Agent Collaboration Networks for Software Development
- GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Large Language Models for Software Engineering: A Systematic Literature Review
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- Improving Planning with Large Language Models: A Modular Agentic Architecture
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- Self-Evolving Multi-Agent Collaboration Networks for Software Development
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Large Language Models for Software Engineering: A Systematic Literature Review
- GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
-
-
Surveys
-
Position Papers / Broadly Applicable Papers