https://github.com/renswickd/semantic-prompt-cache
This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.
https://github.com/renswickd/semantic-prompt-cache
mistral-api optimization rag semantic-caching ttl-cache
Last synced: about 1 month ago
JSON representation
This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.
- Host: GitHub
- URL: https://github.com/renswickd/semantic-prompt-cache
- Owner: renswickd
- Created: 2025-06-08T10:22:17.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-04T10:07:20.000Z (12 months ago)
- Last Synced: 2025-07-04T11:24:09.398Z (12 months ago)
- Topics: mistral-api, optimization, rag, semantic-caching, ttl-cache
- Language: Python
- Homepage:
- Size: 32.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RAG + Semantic Cache System
This project is designed to enhance a Retrieval-Augmented Generation (RAG) pipeline with a custom-built Semantic Cache system. The primary goal is to reduce redundant LLM (Large Language Model) calls, improve system responsiveness, and optimize cost for real-time and large-scale AI applications.
## π Purpose
In traditional RAG pipelines, every user query is processed through document retrieval and LLM generationβeven if a semantically similar query was already answered. This approach increases latency and inflates API usage costs.
This system introduces a semantic caching layer that intercepts incoming queries and compares themβbased on meaning, not just keywordsβagainst previously answered queries. If a sufficiently similar query is found, the cached response is reused, bypassing the need for another LLM call.
## π§ Use Cases
- **Chatbots with memory efficiency**
Minimize repeated LLM calls for frequently asked or rephrased questions.
- **Enterprise knowledge assistants**
Provide consistent and faster answers to similar user queries across departments.
- **High-throughput RAG pipelines**
Scale to thousands of queries per day while maintaining performance and reducing cost.
- **Latency-sensitive applications**
Reduce end-user wait time by short-circuiting the full RAG flow when a cached response is available.
# Semantic Cache for LLM-Enhanced RAG
A modular, non-OOP semantic caching system built to reduce LLM calls and latency in Retrieval-Augmented Generation (RAG) pipelines.
## π§ Features
- β
Embeds user queries using `bge-small-en-v1.5`
- β
Stores query-response pairs with FAISS index
- β
Retrieves cached results based on semantic similarity
- β
Configurable similarity threshold
- β
Supports metadata (timestamps, hits) and leaderboard extensions
- β
Fully functional with Mistral (via Groq) or any OpenRouter-compatible LLM
- β
Enterprise knowledge assistants (e.g. Azure Docs)
- β
High-throughput RAG pipelines
- β
Latency-sensitive LLM apps
# π§± Architecture Overview
```text
ββββββββββββββββββββββββββββββββ
β User Query Input β
ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β 1. Check Semantic Cache (FAISS) β
βββββββββββββββββββββββββββββββββββββββββ
β Yes (high match) β No (miss)
βΌ βΌ
Reuse Cached LLM βββββββββββββββββββββββ
Response β 2. Retrieve Context β
βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββ
β 3. Build Prompt + Inject Docs β
ββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββ
β 4. Generate Response (Mistral LLM) β
ββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββ
β 5. Postprocess + Store in Cache β
ββββββββββββββββββββββββββββββββββββββ
```
## π Key Modules
| Module | Purpose |
|--------|---------|
| `semantic_cache/embedder.py` | Loads BGE model and returns query embeddings |
| `semantic_cache/index_manager.py` | Manages FAISS index creation, loading, saving |
| `semantic_cache/operations.py` | Handles get/set/clear cache operations |
| `rag/retriever.py` | Top-k document retrieval from Azure knowledge base |
| `rag/prompt_builder.py` | Combines retrieved chunks + user question into LLM prompt |
| `rag/llm_client.py` | Calls Mistral via Groq using LangChain |
| `rag/ingest_docs.py` | Preprocesses and uploads local docs into FAISS vectorstore |
| `tests/` | Unit tests for all core functionality |
## π Usage (Example)
```python
from semantic_cache.operations import get_from_cache, set_in_cache
query = "top places to visit in France"
cached = get_from_cache(query)
if cached:
print("β
Cache Hit:", cached)
else:
response = "Paris, Lyon, Nice..."
set_in_cache(query, response)
```
## Run Tests
```python
pytest tests/
```
---
## π Next Steps
π Add leaderboard and TTL/size-based cache trimming
π Ingest Azure PDF documentation automatically
π Wrap with FastAPI for API serving
βοΈ Upgrade from FAISS β Qdrant/Chroma
π€ Migrate from Groq to AI Foundry (multi-LLM orchestration)