An open API service indexing awesome lists of open source software.

https://github.com/renswickd/semantic-prompt-cache

This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.
https://github.com/renswickd/semantic-prompt-cache

mistral-api optimization rag semantic-caching ttl-cache

Last synced: about 1 month ago
JSON representation

This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.

Awesome Lists containing this project

README

          

# RAG + Semantic Cache System

This project is designed to enhance a Retrieval-Augmented Generation (RAG) pipeline with a custom-built Semantic Cache system. The primary goal is to reduce redundant LLM (Large Language Model) calls, improve system responsiveness, and optimize cost for real-time and large-scale AI applications.

## πŸš€ Purpose

In traditional RAG pipelines, every user query is processed through document retrieval and LLM generationβ€”even if a semantically similar query was already answered. This approach increases latency and inflates API usage costs.

This system introduces a semantic caching layer that intercepts incoming queries and compares themβ€”based on meaning, not just keywordsβ€”against previously answered queries. If a sufficiently similar query is found, the cached response is reused, bypassing the need for another LLM call.

## πŸ”§ Use Cases

- **Chatbots with memory efficiency**
Minimize repeated LLM calls for frequently asked or rephrased questions.

- **Enterprise knowledge assistants**
Provide consistent and faster answers to similar user queries across departments.

- **High-throughput RAG pipelines**
Scale to thousands of queries per day while maintaining performance and reducing cost.

- **Latency-sensitive applications**
Reduce end-user wait time by short-circuiting the full RAG flow when a cached response is available.

# Semantic Cache for LLM-Enhanced RAG

A modular, non-OOP semantic caching system built to reduce LLM calls and latency in Retrieval-Augmented Generation (RAG) pipelines.

## πŸ”§ Features

- βœ… Embeds user queries using `bge-small-en-v1.5`
- βœ… Stores query-response pairs with FAISS index
- βœ… Retrieves cached results based on semantic similarity
- βœ… Configurable similarity threshold
- βœ… Supports metadata (timestamps, hits) and leaderboard extensions
- βœ… Fully functional with Mistral (via Groq) or any OpenRouter-compatible LLM
- βœ… Enterprise knowledge assistants (e.g. Azure Docs)
- βœ… High-throughput RAG pipelines
- βœ… Latency-sensitive LLM apps

# 🧱 Architecture Overview

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Query Input β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Check Semantic Cache (FAISS) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Yes (high match) β”‚ No (miss)
β–Ό β–Ό
Reuse Cached LLM β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Response β”‚ 2. Retrieve Context β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Build Prompt + Inject Docs β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Generate Response (Mistral LLM) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Postprocess + Store in Cache β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ“ Key Modules

| Module | Purpose |
|--------|---------|
| `semantic_cache/embedder.py` | Loads BGE model and returns query embeddings |
| `semantic_cache/index_manager.py` | Manages FAISS index creation, loading, saving |
| `semantic_cache/operations.py` | Handles get/set/clear cache operations |
| `rag/retriever.py` | Top-k document retrieval from Azure knowledge base |
| `rag/prompt_builder.py` | Combines retrieved chunks + user question into LLM prompt |
| `rag/llm_client.py` | Calls Mistral via Groq using LangChain |
| `rag/ingest_docs.py` | Preprocesses and uploads local docs into FAISS vectorstore |
| `tests/` | Unit tests for all core functionality |

## πŸš€ Usage (Example)

```python
from semantic_cache.operations import get_from_cache, set_in_cache

query = "top places to visit in France"
cached = get_from_cache(query)

if cached:
print("βœ… Cache Hit:", cached)
else:
response = "Paris, Lyon, Nice..."
set_in_cache(query, response)
```

## Run Tests
```python
pytest tests/
```

---
## πŸ“Œ Next Steps
πŸ” Add leaderboard and TTL/size-based cache trimming

πŸ“š Ingest Azure PDF documentation automatically

🌐 Wrap with FastAPI for API serving

☁️ Upgrade from FAISS β†’ Qdrant/Chroma

πŸ€– Migrate from Groq to AI Foundry (multi-LLM orchestration)