{"id":49635529,"url":"https://github.com/renswickd/semantic-prompt-cache","last_synced_at":"2026-05-05T14:34:51.163Z","repository":{"id":298491031,"uuid":"998304351","full_name":"renswickd/semantic-prompt-cache","owner":"renswickd","description":"This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.","archived":false,"fork":false,"pushed_at":"2025-07-04T10:07:20.000Z","size":33,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-04T11:24:09.398Z","etag":null,"topics":["mistral-api","optimization","rag","semantic-caching","ttl-cache"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/renswickd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-08T10:22:17.000Z","updated_at":"2025-07-04T10:07:24.000Z","dependencies_parsed_at":"2025-06-11T11:53:44.876Z","dependency_job_id":"63b03b61-f998-4d8e-8ec9-0916d5e284b3","html_url":"https://github.com/renswickd/semantic-prompt-cache","commit_stats":null,"previous_names":["renswickd/semantic-prompt-cache"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/renswickd/semantic-prompt-cache","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fsemantic-prompt-cache","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fsemantic-prompt-cache/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fsemantic-prompt-cache/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fsemantic-prompt-cache/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/renswickd","download_url":"https://codeload.github.com/renswickd/semantic-prompt-cache/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fsemantic-prompt-cache/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32653674,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mistral-api","optimization","rag","semantic-caching","ttl-cache"],"created_at":"2026-05-05T14:34:50.351Z","updated_at":"2026-05-05T14:34:51.158Z","avatar_url":"https://github.com/renswickd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG + Semantic Cache System\n\nThis project is designed to enhance a Retrieval-Augmented Generation (RAG) pipeline with a custom-built Semantic Cache system. The primary goal is to reduce redundant LLM (Large Language Model) calls, improve system responsiveness, and optimize cost for real-time and large-scale AI applications.\n\n## 🚀 Purpose\n\nIn traditional RAG pipelines, every user query is processed through document retrieval and LLM generation—even if a semantically similar query was already answered. This approach increases latency and inflates API usage costs.\n\nThis system introduces a semantic caching layer that intercepts incoming queries and compares them—based on meaning, not just keywords—against previously answered queries. If a sufficiently similar query is found, the cached response is reused, bypassing the need for another LLM call.\n\n## 🔧 Use Cases\n\n- **Chatbots with memory efficiency**  \n  Minimize repeated LLM calls for frequently asked or rephrased questions.\n\n- **Enterprise knowledge assistants**  \n  Provide consistent and faster answers to similar user queries across departments.\n\n- **High-throughput RAG pipelines**  \n  Scale to thousands of queries per day while maintaining performance and reducing cost.\n\n- **Latency-sensitive applications**  \n  Reduce end-user wait time by short-circuiting the full RAG flow when a cached response is available.\n\n\n# Semantic Cache for LLM-Enhanced RAG\n\nA modular, non-OOP semantic caching system built to reduce LLM calls and latency in Retrieval-Augmented Generation (RAG) pipelines.\n\n## 🔧 Features\n\n- ✅ Embeds user queries using `bge-small-en-v1.5`\n- ✅ Stores query-response pairs with FAISS index\n- ✅ Retrieves cached results based on semantic similarity\n- ✅ Configurable similarity threshold\n- ✅ Supports metadata (timestamps, hits) and leaderboard extensions\n- ✅ Fully functional with Mistral (via Groq) or any OpenRouter-compatible LLM\n- ✅ Enterprise knowledge assistants (e.g. Azure Docs)\n- ✅ High-throughput RAG pipelines\n- ✅ Latency-sensitive LLM apps\n\n# 🧱 Architecture Overview\n\n```text\n            ┌──────────────────────────────┐\n            │        User Query Input       │\n            └──────────────────────────────┘\n                         │\n                         ▼\n     ┌───────────────────────────────────────┐\n     │ 1. Check Semantic Cache (FAISS)       │\n     └───────────────────────────────────────┘\n         │ Yes (high match)   │ No (miss)\n         ▼                    ▼\n  Reuse Cached LLM     ┌─────────────────────┐\n      Response         │ 2. Retrieve Context │\n                       └─────────────────────┘\n                               │\n                               ▼\n         ┌────────────────────────────────┐\n         │ 3. Build Prompt + Inject Docs  │\n         └────────────────────────────────┘\n                               │\n                               ▼\n        ┌────────────────────────────────────┐\n        │ 4. Generate Response (Mistral LLM) │\n        └────────────────────────────────────┘\n                               │\n                               ▼\n        ┌────────────────────────────────────┐\n        │ 5. Postprocess + Store in Cache    │\n        └────────────────────────────────────┘\n```\n\n## 📁 Key Modules\n\n| Module | Purpose |\n|--------|---------|\n| `semantic_cache/embedder.py` | Loads BGE model and returns query embeddings |\n| `semantic_cache/index_manager.py` | Manages FAISS index creation, loading, saving |\n| `semantic_cache/operations.py` | Handles get/set/clear cache operations |\n| `rag/retriever.py` |\tTop-k document retrieval from Azure knowledge base |\n| `rag/prompt_builder.py` |\tCombines retrieved chunks + user question into LLM prompt |\n| `rag/llm_client.py` |\tCalls Mistral via Groq using LangChain |\n| `rag/ingest_docs.py` |\tPreprocesses and uploads local docs into FAISS vectorstore |\n| `tests/` | Unit tests for all core functionality |\n\n## 🚀 Usage (Example)\n\n```python\nfrom semantic_cache.operations import get_from_cache, set_in_cache\n\nquery = \"top places to visit in France\"\ncached = get_from_cache(query)\n\nif cached:\n    print(\"✅ Cache Hit:\", cached)\nelse:\n    response = \"Paris, Lyon, Nice...\"  \n    set_in_cache(query, response)\n```\n\n## Run Tests\n```python\npytest tests/\n```\n\n\n---\n## 📌 Next Steps\n🔁 Add leaderboard and TTL/size-based cache trimming\n\n📚 Ingest Azure PDF documentation automatically\n\n🌐 Wrap with FastAPI for API serving\n\n☁️ Upgrade from FAISS → Qdrant/Chroma\n\n🤖 Migrate from Groq to AI Foundry (multi-LLM orchestration)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frenswickd%2Fsemantic-prompt-cache","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frenswickd%2Fsemantic-prompt-cache","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frenswickd%2Fsemantic-prompt-cache/lists"}