awesome-rag-study
A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG). Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.
https://github.com/nluninja/awesome-rag-study
Last synced: 16 days ago
JSON representation
-
Advanced Techniques
-
Chunking and Indexing
- Unstructured - Pre-processing library for parsing PDFs, HTML, Word docs into clean chunks.
- Semantic Chunking - Splitting documents based on semantic similarity rather than fixed token windows.
-
Evaluation
-
Query Transformation
- Step-Back Prompting - Ask a more abstract question first to retrieve broader context.
-
Reranking
- Cohere Rerank - Cross-encoder reranking API.
- ColBERT - Late interaction model for efficient and effective reranking.
- bge-reranker - Open-source cross-encoder reranker by BAAI.
- RankLLM - Using LLMs themselves as rerankers via listwise prompting.
-
Retrieval Strategies
- Hypothetical Document Embeddings - Generate a hypothetical answer first, then use it as the retrieval query. |
- Anthropic's approach - Prepend chunk-specific context before embedding to reduce retrieval failures. |
-
-
Datasets and Benchmarks
-
Evaluation
- HotpotQA - hop QA requiring reasoning over multiple documents. |
- MS MARCO - scale passage retrieval and QA benchmark. |
- BEIR - shot evaluation of retrieval models across diverse tasks. |
- RGB (Retrieval-Augmented Generation Benchmark)
-
-
Embedding Models
-
Evaluation
- text-embedding-3-small/large - purpose embeddings. `large` variant has 3072 dimensions. |
- Cohere Embed v3
- BGE (BAAI) - source | Top-performing open-source embeddings. Available in multiple sizes. |
- E5-Mistral - source | LLM-based embedding model. Strong performance on MTEB benchmark. |
- Nomic Embed - source | Long-context (8192 tokens), fully open-source with open training data. |
- Jina Embeddings - source | Multilingual, supports 8192 token context. Good for non-English corpora. |
- MTEB Leaderboard - to-date embedding model benchmarks.
-
-
Foundational Papers
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - The original RAG paper by Lewis et al. (Meta AI). Introduces the RAG architecture combining a pre-trained seq2seq model with a dense retriever (DPR).
- Dense Passage Retrieval for Open-Domain Question Answering - DPR — the dense retrieval method that underpins many RAG systems.
- REALM: Retrieval-Augmented Language Model Pre-Training - Pre-trains a language model jointly with a knowledge retriever.
- Attention Is All You Need - The Transformer architecture — foundational to all modern LLMs used in RAG.
-
Frameworks and Libraries
-
Evaluation
- LangChain
- LlamaIndex
- Haystack - ready NLP framework by deepset. Pipeline-based architecture. |
- RAGFlow - source RAG engine with deep document understanding and chunk visualization. |
- Verba - source RAG chatbot powered by Weaviate. Good for quick prototyping. |
- Cognita - source modular RAG framework for production use. |
-
-
Survey Papers
- Retrieval-Augmented Generation for Large Language Models: A Survey - Comprehensive survey covering Naive RAG, Advanced RAG, and Modular RAG paradigms. Excellent starting point.
- A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models - Covers the evolution of RA-LLMs, taxonomies, and training strategies.
- Seven Failure Points When Engineering a RAG System - Practical guide to what can go wrong in RAG pipelines — highly recommended for thesis work.
-
Tutorials and Guides
-
Evaluation
- RAG From Scratch (LangChain) - Series of notebooks covering RAG concepts from basics to advanced patterns.
- Building RAG Applications with LlamaIndex - Official LlamaIndex documentation and conceptual guide.
- Pinecone RAG Learning Center - Well-written introduction to RAG with practical examples.
- Full Stack RAG App Tutorial (freeCodeCamp) - Video walkthrough of building a complete RAG application.
- Building RAG Applications with LlamaIndex - Official LlamaIndex documentation and conceptual guide.
-
-
Vector Databases
-
Evaluation
- Chroma
- Weaviate - hosted / Cloud | Supports hybrid search natively. GraphQL API. |
- Qdrant - hosted / Cloud | Written in Rust. Excellent filtering and payload support. |
- Milvus - hosted / Cloud | Highly scalable. Used in many production deployments. |
- Pinecone - only | Fully managed. Simple API. Popular in industry. |
- FAISS
- pgvector
-
-
Videos and Talks
-
Evaluation
- But what is RAG? (3Blue1Brown-style explainer) - Visual, intuitive explanation of how RAG works.
- RAG is Dead? Long Live RAG! (Keynote) - Discussion on the future of RAG vs. long-context models.
- Building Production RAG (AI Engineer Summit) - Practical lessons from deploying RAG at scale.
- Advanced RAG Techniques (DeepLearning.AI) - Short course by Andrew Ng's platform.
-
Programming Languages
Categories
Sub Categories
Keywords
llm
8
rag
7
machine-learning
5
agents
4
information-retrieval
4
vector-database
4
nearest-neighbor-search
4
deep-learning
4
nlp
4
mlops
3
image-search
3
hnsw
3
python
3
vector-search
3
generative-ai
3
retrieval-augmented-generation
3
llmops
3
chatgpt
3
ai
3
bert
2
llms
2
document-parser
2
llm-evaluation
2
pdf-to-text
2
pytorch
2
approximate-nearest-neighbor-search
2
ai-agents
2
framework
2
langchain
2
vector-search-engine
2
similarity-search
2
semantic-search
2
search-engine
2
recommender-system
2
neural-search
2
ai-search
2
semantic-search-engine
1
grpc
1
hybrid-search
1
generative-search
1
vectors
1
weaviate
1
anns
1
cloud-native
1
anthropic
1
deepagents
1
enterprise
1
gemini
1
langgraph
1
multiagent
1