https://github.com/ronantakizawa/sleeptimecompute
A Demo of Running Sleep-time Compute to Reduce LLM Latency
https://github.com/ronantakizawa/sleeptimecompute
ai llm llm-optimization
Last synced: 3 months ago
JSON representation
A Demo of Running Sleep-time Compute to Reduce LLM Latency
- Host: GitHub
- URL: https://github.com/ronantakizawa/sleeptimecompute
- Owner: ronantakizawa
- Created: 2025-05-15T12:22:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-17T04:07:29.000Z (about 1 year ago)
- Last Synced: 2025-06-12T04:07:46.488Z (12 months ago)
- Topics: ai, llm, llm-optimization
- Language: Jupyter Notebook
- Homepage: https://medium.com/@ronantech/demo-how-to-run-sleep-time-compute-to-reduce-llm-latency-84c5626d0770
- Size: 396 KB
- Stars: 16
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# A Demo of Sleep-Time Compute to Reduce LLM Latency

This repository demonstrates the "Sleep-time Compute" technique described in the paper ["Sleep-time Compute: Beyond Inference Scaling at Test-time"](https://arxiv.org/pdf/2504.13171) using open-source LLMs.
Google Colab: [https://colab.research.google.com/drive/12Itg_XOCP9sRezztBIIRY97QHli0Lpg8?usp=sharing](https://colab.research.google.com/drive/12MFTFs9YMoOL-znKgYBY5heoEl9qm8AZ?usp=sharing)
Explanation Article: https://medium.com/@ronantech/demo-how-to-run-sleep-time-compute-to-reduce-llm-latency-84c5626d0770
## Overview
Sleep-time Compute is a technique that improves the efficiency and accuracy of language models by splitting computation into two phases:
1. **Sleep-time Phase**: Pre-compute useful inferences about a context when the model would otherwise be idle
2. **Test-time Phase**: Use these pre-computed inferences to answer queries more efficiently
This approach offers several benefits:
- Reduced latency during query time
- Improved accuracy through deeper context understanding
- Cost efficiency through amortization across multiple queries
## When to Use Sleep-Time Compute
Based on the research findings, Sleep-time Compute is most effective in the following scenarios:
- **Stateful Applications**: Systems where context persists across multiple interactions, such as:
- Document question-answering
- Coding assistants operating on shared repositories
- Conversational agents maintaining dialogue history
- **Predictable Queries**: Contexts where potential questions follow predictable patterns
- Research shows the performance gap widens with more predictable queries
- Less effective when queries are difficult to predict or unrelated to the context
- **Multiple Related Queries**: When users ask several questions about the same context
- Cost efficiency improves as the number of queries increases
- Research demonstrates a 2.5× decrease in average cost per query with 10 queries per context
- **High-Latency Constraints**: Applications where reducing test-time compute is critical
- Particularly valuable when test-time tokens are significantly more expensive
- Can reduce test-time compute needed for the same accuracy by ~5×
## When Not to Use Sleep-Time Compute
Sleep-time Compute may not be beneficial in these scenarios:
- **Unpredictable Queries**: When questions are difficult to anticipate from the context
- The research shows diminishing returns for less predictable queries
- Standard test-time compute may be more effective in these cases
- **Single Query Scenarios**: With only one question per context
- The overhead of sleep-time compute isn't amortized
- Cost efficiency significantly drops without multiple related queries
- **High Test-Time Budget Settings**: In applications where extensive test-time compute is already allocated
- Research shows standard test-time compute can sometimes outperform sleep-time compute when sufficient test-time resources are available
- **Non-Stateful Applications**: Systems where context doesn't persist between interactions
- Without a persistent context to analyze during idle time, the core benefit is lost
- **Rapidly Changing Contexts**: Environments where the context is frequently updated
- Pre-computed inferences may quickly become outdated
## Implementation Details
This demo implements Sleep-time Compute using:
- **Mistral-7B-Instruct-v0.1**: A powerful open-source language model
- **Hugging Face Transformers**: For model loading and inference
- **Custom prompting strategies**: Specially designed for the Mistral instruction format
The code demonstrates:
- Setting up the model with appropriate configurations
- Implementing the sleep-time and test-time phases
- Visualizing the benefits through token usage and accuracy metrics
- Multi-query amortization to show efficiency gains
## Key Features
- **Two-Phase Approach**: Clear separation between sleep-time and test-time computation
- **Variable Verbosity**: Control the level of detail in responses
- **Performance Comparison**: Analysis of regular vs. sleep-time compute approaches
- **Visualization**: Graphs showing efficiency gains and amortization benefits
## Results
The implementation demonstrates:
1. **Test-time Efficiency**: Significant reduction in tokens needed at query time
2. **Accuracy Improvements**: More reliable answers through pre-computed inferences
3. **Cost Amortization**: Greater efficiency as the number of queries increases
