{"id":22353464,"url":"https://github.com/microsoft/synthetic-rag-index","last_synced_at":"2025-06-20T00:03:29.959Z","repository":{"id":243222203,"uuid":"811798013","full_name":"microsoft/synthetic-rag-index","owner":"microsoft","description":"Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.","archived":false,"fork":false,"pushed_at":"2024-10-11T17:47:54.000Z","size":143266,"stargazers_count":31,"open_issues_count":16,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-13T00:12:55.278Z","etag":null,"topics":["azure","document-analysis","few-shot-learning","large-language-model","llm","rag","retrieval-augmented-generation","serverless"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-07T10:20:25.000Z","updated_at":"2025-05-23T03:06:08.000Z","dependencies_parsed_at":"2024-07-15T20:17:59.798Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/synthetic-rag-index","commit_stats":null,"previous_names":["clemlesne/rag-index","microsoft/synthetic-rag-index"],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/synthetic-rag-index","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fsynthetic-rag-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fsynthetic-rag-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fsynthetic-rag-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fsynthetic-rag-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/synthetic-rag-index/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fsynthetic-rag-index/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260852085,"owners_count":23072586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","document-analysis","few-shot-learning","large-language-model","llm","rag","retrieval-augmented-generation","serverless"],"created_at":"2024-12-04T13:08:37.294Z","updated_at":"2025-06-20T00:03:24.919Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧠 Synthetic RAG Index\n\nService to import data from various sources (e.g. PDF, images, Microsoft Office, HTML) and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.\n\n\u003c!-- github.com badges --\u003e\n[![Last release date](https://img.shields.io/github/release-date/clemlesne/synthetic-rag-index)](https://github.com/clemlesne/synthetic-rag-index/releases)\n[![Project license](https://img.shields.io/github/license/clemlesne/synthetic-rag-index)](https://github.com/clemlesne/synthetic-rag-index/blob/main/LICENSE)\n\n\u003c!-- GitHub Codespaces badge --\u003e\n[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/synthetic-rag-index?quickstart=1)\n\n## Overview\n\nIn a real-world scenario, with a public corpus of 15M characters (222 PDF, 7.330 pages), 2.940 facts were generated (8.41 MB indexed). That's a 93% reduction in document amount compared to the chunck method (48.111 chuncks, 300 characters each).\n\nIt includes principles taken from research papers:\n\n1. Repetition removal (\u003chttps://arxiv.org/abs/2112.11446\u003e)\n2. Corpus cleaning (\u003chttps://arxiv.org/abs/1910.10683\u003e)\n3. Synthetic data generation (\u003chttps://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1\u003e)\n\nFuncional workflow is as follows:\n\n```mermaid\n---\ntitle: Workflow\n---\ngraph LR\n  raw[(\"Raw\")]\n  sanitize[\"Sanitize\"]\n  extract[\"Extract\"]\n  chunck[\"Chunck\"]\n  synthesis[\"Synthetisis\"]\n  page[\"Page\"]\n  fact[\"Fact\"]\n  critic[\"Critic\"]\n  index[(\"Index\")]\n\n  raw --\u003e sanitize\n  sanitize --\u003e extract\n  extract --\u003e chunck\n  chunck --\u003e synthesis\n  chunck --\u003e synthesis\n  synthesis --\u003e page\n  page --\u003e fact\n  page --\u003e fact\n  fact --\u003e critic\n  critic --\u003e index\n  critic --\u003e index\n```\n\n### Features\n\n\u003e [!NOTE]\n\u003e This project is a proof of concept. It is not intended to be used in production. This demonstrates how can be combined Azure serverless technologies and LLM to a high quality search engine for RAG scenarios.\n\n- [x] Cost anything when not used thanks to serverless architecture\n- [x] Data can be searched with semantic queries using AI Search\n- [x] Deduplicate content\n- [x] Extract text from PDF, images, Microsoft Office, HTML\n- [x] Garbage data detection\n- [x] Index files from more than 1000 pages\n- [x] Remove redundant and irrelevant content by synthesis data generation\n\n### Format support\n\nDocument extraction is based on Azure Document Intelligence, specifically on the `prebuilt-layout` model. It [supports popular formats](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0\u0026tabs=sample-code#input-requirements).\n\nSome formats are first converted to PDF [with MuPDF](https://github.com/ArtifexSoftware/mupdf) to ensure compatibility with Document Intelligence.\n\n\u003e [!IMPORTANT]\n\u003e Formats not listed there are treated as binary and decoded with `UTF-8` encoding.\n\n| `Format` | **OCR** | **Details** |\n|-|-|-|\n| `.bmp` | ✅ | |\n| `.cbz` | ✅ | First converted to PDF with MuPDF. |\n| `.docx` | ✅ | |\n| `.epub` | ✅ | First converted to PDF with MuPDF. |\n| `.fb2` | ✅ | First converted to PDF with MuPDF. |\n| `.heif` | ✅ | |\n| `.html` | ✅ | |\n| `.jpg`, `.jpeg` | ✅ | |\n| `.mobi` | ✅ | First converted to PDF with MuPDF. |\n| `.pdf` | ✅ | Sanitized \u0026 compressed with MuPDF. |\n| `.png` | ✅ | |\n| `.pptx` | ✅ | |\n| `.svg` | ✅ | First converted to PDF with MuPDF. |\n| `.tiff` | ✅ | |\n| `.xlsx` | ✅ | |\n| `.xps` | ✅ | First converted to PDF with MuPDF. |\n\n### Demo\n\nAs an example, we take the [code_des_assurances_2024_1.pdf](examples/raw/code_des_assurances_2024_1.pdf) file.\n\nFirst, data is extracted from its binary format:\n\n```json\n{\n  \"created_at\": \"2024-06-08T19:17:51.229972Z\",\n  \"document_content\": \"Code des assurances\\n===\\n\\ndroit. org Institut Français d'Information Juridique\\n\\nDernière modification: 2024-01-01 Edition : 2024-01-19 2347 articles avec 5806 liens 57 références externes\\n\\nCe code ne contient que du droit positif français, les articles et éléments abrogés ne sont pas inclus. Il est recalculé au fur et à mesure des mises à jour. Pensez à actualiser votre copie régulièrement à partir de codes.droit.org.\\n\\nCes codes ont pour objectif de démontrer l'utilité de l'ouverture des données publiques juridiques tant législatives que jurisprudentielles. Il s'y ajoute une promotion du mouvement Open Science Juridique avec une incitation au dépôt du texte intégral en accès ouvert des articles de doctrine venant du monde professionnel (Grande Bibliothèque du Droit) et universitaire (HAL-CNRS).\\n\\nTraitements effectués à partir des données issues des APIs Legifrance et Judilibre. droit.org remercie les acteurs du Web qui autorisent des liens vers leur production : Dictionnaire du Droit Privé (réalisé par MM. Serge Braudo et Alexis Baumann), le Conseil constitutionnel, l'Assemblée Nationale, et le Sénat. [...]\",\n  \"file_path\": \"raw/code_des_assurances_2024_1.pdf\",\n  \"format\": \"markdown\",\n  \"langs\": [\"es\", \"la\", \"fr\", \"ja\", \"en\", \"it\", \"pt\", \"no\"],\n  \"title\": \"Code des assurances\\n===\"\n}\n```\n\nSecond, document is paged, and each page is synthesized to keep track of the context during all steps:\n\n```json\n{\n  \"synthesis\": \"The \\\"Code des assurances\\\" is structured into several legislative parts and chapters, each dealing with various aspects of insurance law and regulations in France. It covers a wide range of insurance-related subjects including the operation of insurance and reinsurance contracts, the requirements for companies, the obligations of insurers and insured, and the legal framework governing insurance practices. The document includes regulations about the constitution and operation of insurance entities, rules for granting administrative approvals, conditions for opening branches and operating under free provision of services, among others.\\n\\nSpecifically, it addresses the following:\\n1. The legislative basis for insurance contracts.\\n2. Detailed provisions on maritime, aerial, and space liability insurances.\\n3. Obligations for reporting and transparency in insurance practices.\\n4. Rules for life insurance and capitalizations applicable in specific French regions and territories.\\n5. Provisions for mandatory insurance types, like vehicle insurance, residence insurance, and insurance of construction work.\\n6. Specific rules and exceptions for departments like Bas-Rhin, Haut-Rhin, and Moselle and applicability in French overseas territories. [...]\"\n}\n\n```\n\nThird, multiple facts (=Q\u0026A pairs) are generated, and those are critiqued to keep only the most relevant ones:\n\n```json\n{\n  \"facts\": [\n    {\n      \"answer\": \"The 'Code des assurances' only contains active French law; abrogated articles and elements are not included.\",\n      \"context\": \"This exclusion ensures that the code remains up-to-date and relevant, reflecting the current legal landscape without outdated information.\",\n      \"question\": \"What elements are excluded from the 'Code des assurances'?\"\n    },\n    {\n      \"answer\": \"Insurance can be contracted for the policyholder, for another specified person, or for whomever it may concern.\",\n      \"context\": \"This flexibility allows insurance policies to be tailored to various scenarios, ensuring broad applicability and relevance to different stakeholders.\",\n      \"question\": \"For whom can insurance be contracted according to the document?\"\n    }\n  ]\n}\n\n```\n\nFinally, facts are individually indexed in AI Search:\n\n```json\n{\n  \"answer\": \"The 'Code des assurances' only contains active French law; abrogated articles and elements are not included.\",\n  \"context\": \"This exclusion ensures that the code remains up-to-date and relevant, reflecting the current legal landscape without outdated information.\",\n  \"document_synthesis\": \"The \\\"Code des assurances\\\" is structured into several legislative parts and chapters, each dealing with various aspects of insurance law and regulations in France. It covers a wide range of insurance-related subjects including the operation of insurance and reinsurance contracts, the requirements for companies, the obligations of insurers and insured, and the legal framework governing insurance practices. The document includes regulations about the constitution and operation of insurance entities, rules for granting administrative approvals, conditions for opening branches and operating under free provision of services, among others.\\n\\nSpecifically, it addresses the following:\\n1. The legislative basis for insurance contracts.\\n2. Detailed provisions on maritime, aerial, and space liability insurances.\\n3. Obligations for reporting and transparency in insurance practices.\\n4. Rules for life insurance and capitalizations applicable in specific French regions and territories.\\n5. Provisions for mandatory insurance types, like vehicle insurance, residence insurance, and insurance of construction work.\\n6. Specific rules and exceptions for departments like Bas-Rhin, Haut-Rhin, and Moselle and applicability in French overseas territories. [...]\",\n  \"file_path\": \"raw/code_des_assurances_2024_1.pdf\",\n  \"id\": \"93e5846ba121abf6ea3328a7ff5a96b60ab97ce2016166ac0384f2e61a963d6d\",\n  \"question\": \"What elements are excluded from the 'Code des assurances'?\"\n}\n```\n\n### High level architecture\n\n```mermaid\n---\ntitle: High level process\n---\ngraph LR\n  importer[\"Importer\"]\n  openai_ada[\"Ada\\n(OpenAI)\"]\n  search_index[\"Index\\n(AI Search)\"]\n  storage[(\"Blob\\n(Storage Account)\")]\n\n  importer -- Pull from --\u003e storage\n  importer -- Push to --\u003e search_index\n  search_index -. Generate embeddings .-\u003e openai_ada\n```\n\n### Component level architecture\n\n```mermaid\n---\ntitle: Importer component diagram (C4 model)\n---\ngraph LR\n  openai_ada[\"Ada\\n(OpenAI)\"]\n  search_index[\"Index\\n(AI Search)\"]\n  storage[(\"Blob\\n(Storage Account)\")]\n\n  subgraph importer[\"Importer\"]\n    document[\"Document extraction\\n(Document Intelligence)\"]\n    openai_gpt[\"GPT-4o\\n(OpenAI)\"]\n\n    func_chunck[\"Chunck\\n(Function App)\"]\n    func_critic[\"Critic\\n(Function App)\"]\n    func_extract[\"Extracted\\n(Function App)\"]\n    func_fact[\"Fact\\n(Function App)\"]\n    func_index[\"Index\\n(Function App)\"]\n    func_page[\"Page\\n(Function App)\"]\n    func_sanitize[\"Sanitize\\n(Function App)\"]\n    func_synthesis[\"Synthetisis\\n(Function App)\"]\n  end\n\n\n  func_sanitize -- Pull from --\u003e storage\n  func_sanitize -- Convert and linearize --\u003e func_sanitize\n  func_sanitize -- Push to --\u003e func_extract\n  func_extract -- Ask for extraction --\u003e document\n  func_extract -. Poll for result .-\u003e document\n  func_extract -- Push to --\u003e func_chunck\n  func_chunck -- Split into large parts --\u003e func_chunck\n  func_chunck -- Push to --\u003e func_synthesis\n  func_synthesis -- Create a chunck synthesis --\u003e openai_gpt\n  func_synthesis -- Push to --\u003e func_page\n  func_page -- Split into small parts --\u003e func_page\n  func_page -- Clean and filter repetitive content --\u003e func_page\n  func_page -- Push to --\u003e func_fact\n  func_fact -- Create Q/A pairs --\u003e openai_gpt\n  func_fact -- Push to --\u003e func_critic\n  func_critic -- Push to --\u003e func_index\n  func_critic -- Create a score for each fact --\u003e openai_gpt\n  func_critic -- Filter out irrelevant facts --\u003e func_critic\n  func_index -- Generate reproductible IDs --\u003e func_index\n  func_index -- Push to --\u003e search_index\n  search_index -. Generate embeddings .-\u003e openai_ada\n```\n\n### Usage cost\n\nFrom experiments, the cost of indexing a document is around 29.15€ per 1k pages. Here is a detailed breakdown:\n\nScenario:\n\n- 7.330 pages (15M characters)\n- 222 PDF (550.50 MB)\n- French (90%) and English (10%)\n\nOutcome:\n\n- 2.940 facts generated\n- 8.41 MB indexed on AI Search\n\nCost:\n\n| Service | Usage | Cost (abs) | Cost (per 1k pages) |\n|-|-|-|-|\n| **Azure AI Search** | Billed per hour | N/A | N/A |\n| **Azure Blob Storage** | N/A | N/A | N/A |\n| **Azure Document Intelligence** | 7.330 pages | 67,79€ | 9.25€ |\n| **Azure Functions** | N/A | N/A | N/A |\n| **Azure OpenAI GPT-4o** (in) | 23.79M tokens | 111,81€ | 15.25€ |\n| **Azure OpenAI GPT-4o** (out) | 2.45M tokens | 34,06€ | 4.65€ |\n| **Total** | | **213,66€** | **29.15€** |\n\n## Local installation\n\nSome prerequisites are needed to deploy the solution.\n\n[Prefer using GitHub Codespaces for a quick start.](https://codespaces.new/microsoft/synthetic-rag-index?quickstart=1) The environment will setup automatically with all the required tools.\n\nIn macOS, with [Homebrew](https://brew.sh), simply type `make brew`.\n\nFor other systems, make sure you have the following installed:\n\n- Bash compatible shell, like `bash` or `zsh`\n- Make, `apt install make` (Ubuntu), `yum install make` (CentOS), `brew install make` (macOS)\n- [Azure Functions Core Tools](https://github.com/Azure/azure-functions-core-tools?tab=readme-ov-file#installing)\n\nPlace a file called `config.yaml` in the root of the project with the following content:\n\n```yaml\n# config.yaml\nllm:\n  fast:\n    mode: azure_openai\n    azure_openai:\n      api_key: xxx\n      context: 16385\n      deployment: gpt-35-turbo-0125\n      endpoint: https://xxx.openai.azure.com\n      model: gpt-35-turbo\n      streaming: true\n  slow:\n    mode: azure_openai\n    azure_openai:\n      api_key: xxx\n      context: 128000\n      deployment: gpt-4o-2024-05-13\n      endpoint: https://xxx.openai.azure.com\n      model: gpt-4o\n      streaming: true\n\ndestination:\n  mode: ai_search\n  ai_search:\n    access_key: xxx\n    endpoint: https://xxx.search.windows.net\n    index: trainings\n\ndocument_intelligence:\n  access_key: xxx\n  endpoint: https://xxx.cognitiveservices.azure.com\n```\n\nTo use a Service Principal to authenticate to Azure, you can also add the following in a `.env` file:\n\n```dotenv\nAZURE_CLIENT_ID=xxx\nAZURE_CLIENT_SECRET=xxx\nAZURE_TENANT_ID=xxx\n```\n\nTo override a specific configuration value, you can also use environment variables. For example, to override the `llm.fast.azure_openai.endpoint` value, you can use the `LLM__FAST__AZURE_OPENAI__ENDPOINT` variable:\n\n```dotenv\nLLM__FAST__AZURE_OPENAI__ENDPOINT=https://xxx.openai.azure.com\n```\n\nThen run:\n\n```bash\n# Install dependencies\nmake install\n```\n\nAI Search also requires to be configured with the following index:\n\n| **Field Name** | `Type` | Retrievable | Searchable | Dimensions | Vectorizer |\n|-|-|-|-|-|-|\n| **answer** | `Edm.String` | Yes | Yes | | |\n| **context** | `Edm.String` | Yes | Yes | | |\n| **created_at** | `Edm.String` | Yes | No | | |\n| **document_synthesis** | `Edm.String` | Yes | Yes | | |\n| **file_path** | `Edm.String` | Yes | No | | |\n| **id** | `Edm.String` | Yes | No | | |\n| **question** | `Edm.String` | Yes | Yes | | |\n| **vectors** | `Collection(Edm.Single)` | No | Yes | 1536 | *OpenAI ADA* |\n\n### Run\n\nFinally, run:\n\n```bash\n# Start the local API server\nmake dev\n```\n\n## Advanced usage\n\n### Configuration\n\nFeatures are documented in [features.py](helpers/config_models/features.py). The features can all be overridden in `config.yaml` file:\n\n```yaml\n# config.yaml\nfeatures:\n  fact_iterations: 10\n  fact_score_threshold: 0.5\n  page_split_size: 2000\n\n[...]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fsynthetic-rag-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fsynthetic-rag-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fsynthetic-rag-index/lists"}