{"id":40842155,"url":"https://github.com/davidsbatista/haystack-retrieval","last_synced_at":"2026-01-21T23:06:54.359Z","repository":{"id":266285060,"uuid":"888589215","full_name":"davidsbatista/haystack-retrieval","owner":"davidsbatista","description":"Different retrieval techniques implemented in Haystack","archived":false,"fork":false,"pushed_at":"2025-04-24T11:56:42.000Z","size":22624,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-24T12:49:49.650Z","etag":null,"topics":["haystack-ai","information-retrieval","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidsbatista.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-14T16:53:59.000Z","updated_at":"2025-04-24T11:56:45.000Z","dependencies_parsed_at":"2025-04-24T12:58:23.437Z","dependency_job_id":null,"html_url":"https://github.com/davidsbatista/haystack-retrieval","commit_stats":null,"previous_names":["davidsbatista/haystack-retrieval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/davidsbatista/haystack-retrieval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsbatista%2Fhaystack-retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsbatista%2Fhaystack-retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsbatista%2Fhaystack-retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsbatista%2Fhaystack-retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidsbatista","download_url":"https://codeload.github.com/davidsbatista/haystack-retrieval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsbatista%2Fhaystack-retrieval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28646718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T21:29:11.980Z","status":"ssl_error","status_checked_at":"2026-01-21T21:24:31.872Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["haystack-ai","information-retrieval","rag","retrieval-augmented-generation"],"created_at":"2026-01-21T23:06:53.639Z","updated_at":"2026-01-21T23:06:54.352Z","avatar_url":"https://github.com/davidsbatista.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Retrieving with Haystack 2.x\n\nThis repository contains showcases different retrieval techniques within the context of a RAG-based QA system.\nThe retrieval techniques are implemented using the [Haystack 2.x](https://github,com/deepset-ai/haystack) library and \nevaluated using the [ARAGOG dataset](https://github.com/predlico/ARAGOG) with the Semantic Similarity metric.\n\n- [Presentation Slides](2025_04_PyConLit.pdf)\n\u003c!-- - [YouTube video recording](https://www.youtube.com/watch?v=_XntQls_j1A)--\u003e\n\n### Retrieval Techniques \n\n1. [Sentence Window Retrieval](#sentence-window-retrieval)\n2. [Auto-Merging Retrieval](#auto-merging-retrieval)\n3. [Maximum Marginal Relevance](#maximum-marginal-relevance)\n4. [Hybrid Search Retrieval](#hybrid-search-retrieval)\n5. [Multi-Query](#multi-query)\n6. [Hypothetical Document Embeddings - HyDE](#hypothetical-document-embeddings---hyde)\n7. [Document Summary Index](#document-summary-index)\n\n---\n\n## Sentence Window Retrieval\n\n\u003cimg src=\"images/sentence_window_retrieval.png\" width=\"80%\"\u003e\n\nThe sentence window retrieval technique that allows for the retrieval of the context around relevant sentences.\nDuring indexing, documents are broken into smaller chunks or sentences and indexed. During retrieval, the sentences most relevant to a given query, based on a certain similarity metric, are retrieved.\nOnce we have the relevant sentences, we can retrieve neighboring sentences to provide full context. The number of neighboring sentences to retrieve is defined by a fixed number of sentences before and after the relevant sentence.\n\n## Auto-Merging Retrieval\n\n\u003cimg src=\"images/auto_merging_retrieval.png\" width=\"80%\"\u003e\n\nAuto-Merging is a retrieval technique that leverages a hierarchical document structure. Where we can think of the smaller \ndocuments as the children of the original document and the original document as the parent. This results in a \nhierarchical tree structure where each smaller document is a child of a previous larger document. The retrieval process\nstarts by retrieving the most relevant leaf nodes (smallest documents) and then deciding whether to return them or the \nparent depending on whether the number of matched leaf nodes below the same parent is above a certain threshold.\n\n## Maximum Marginal Relevance\n\n\u003cimg src=\"images/maximum_marginal_relevance.png\" width=\"85%\"\u003e\n\nMaximum Marginal Relevance ranks documents by selecting first those relevant to the query and dissimilar to the \nalready retrieved. [[1](#1)] This technique is used to re-rank the documents retrieved by the baseline RAG model.\nIt aims to balance the trade-off between relevance and diversity, going towards the objective that a document should be \nrelevant to the user's query and have minimal similarity to previously selected documents.\n\n## Hybrid Search Retrieval\n\n\u003cimg src=\"images/hybird_search.png\" width=\"80%\"\u003e\n\nHybrid Search Retrieval combines multiple retrieval strategies, One example of hybrid search is combining keyword search \nwith semantic search, for instance, a BM25 retrieval and a embeddings-based retrieval. One example of where hybrid\nsearch is useful can be seen in platforms like e-commerce websites. Users might either specific product names or features \n(handled well by keyword search) but also describe what they are looking for in broader terms (better suited for semantic search). \n\n## Multi-Query\n \n\u003cimg src=\"images/multi_query.png\" width=\"80%\"\u003e\n\nMulti-query retrieves documents based on multiple queries generated from the original query by using synonyms, different\nword orders, or other transformations. This technique is used to retrieve documents that might not be retrieved by the\noriginal query but are relevant to the user's information need. Another possibility of multi-query is by breaking down \nthe original query into multiple sub-queries. This technique is useful when the original query is too broad or ambiguous, \nand the retrieval system can benefit from multiple interpretations of the query.\n\n## Hypothetical Document Embeddings - HyDE\n\n\u003cimg src=\"images/hyde.png\" width=\"80%\"\u003e\n\nGiven a query, the Hypothetical Document Embeddings (HyDE)[[2](#2)] first zero-shot prompts an instruction-following language model \nto generate a “fake” hypothetical document that captures relevant textual patterns from the initial query - in practice, \nthis is done five times. Then, it encodes each hypothetical document into an embedding vector and averages them. The resulting, \nsingle embedding can be used to identify a neighbourhood in the document embedding space from which similar actual \ndocuments are retrieved based on vector similarity.\n\n## Document Summary Index\n\n\u003cimg src=\"images/document_summary_indexing.png\" width=\"85%\"\u003e\n\nDocument Summary Index leverages document summaries for retrieval and uses full text documents for response generation. [[3](#3)]\nIt is a two-step retrieval process. First, the document summaries are indexed. Then, the full text documents are indexed\ninto chunks. The document summaries are used to retrieve the full text documents. This technique is used to improve the retrieval\nperformance of the RAG model by using the document summaries to retrieve the full text documents.\n\n# Summary\n\n\u003cimg src=\"images/summary.png\" width=\"90%\"\u003e\n\n# Experimental Results\n\nThe following table shows the semantic similarity of the answers retrieved by the different techniques over the [ARAGOG \ndataset](https://github.com/predlico/ARAGOG). The results are obtained by comparing the retrieved answers with the ground truth answers using the Semantic\nSimilarity metric.\n\n| Technique                                 | Semantic Answer Similarity |\n|-------------------------------------------|----------------------------|\n| Sentence-Window Retrieval                 | 0.700                      |\n| Auto-Merging Retrieval                    | 0.505                      |\n| Baseline RAG + Maximum Marginal Relevance | 0.670                      |\n| Hybrid Search                             | 0.699                      |\n| Multi-Query                               | 0.620                      |\n| Hypothetical Document Embeddings - HyDE   | 0.693                      |\n| Document Summary Index                    | 0.731                      |\n\n## References\n\n1. \u003ca name=\"1\"\u003e\u003c/a\u003e[The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf)\n2. \u003ca name=\"2\"\u003e\u003c/a\u003e[Hypothetical Document Embeddings - HyDE](https://aclanthology.org/2023.acl-long.99/)\n3. \u003ca name=\"3\"\u003e\u003c/a\u003e[A New Document Summary Index for LLM-Powered QA Systems](https://www.llamaindex.ai/blog/a-new-document-summary-index-for-llm-powered-qa-systems-9a32ece2f9ec)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidsbatista%2Fhaystack-retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidsbatista%2Fhaystack-retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidsbatista%2Fhaystack-retrieval/lists"}