awesome-rag-study

A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG). Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.
https://github.com/nluninja/awesome-rag-study

Last synced: 6 days ago
JSON representation

Advanced Techniques
- Chunking and Indexing
  - Unstructured - Pre-processing library for parsing PDFs, HTML, Word docs into clean chunks.
  - Semantic Chunking - Splitting documents based on semantic similarity rather than fixed token windows.
- Evaluation
  - RAGAS - free evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall. |
  - TruLens - specific metrics. |
  - DeepEval - aware metrics. |
  - ARES
- Query Transformation
  - Step-Back Prompting - Ask a more abstract question first to retrieve broader context.
- Reranking
  - Cohere Rerank - Cross-encoder reranking API.
  - ColBERT - Late interaction model for efficient and effective reranking.
  - bge-reranker - Open-source cross-encoder reranker by BAAI.
  - RankLLM - Using LLMs themselves as rerankers via listwise prompting.
- Retrieval Strategies
  - Hypothetical Document Embeddings - Generate a hypothetical answer first, then use it as the retrieval query. |
  - Anthropic's approach - Prepend chunk-specific context before embedding to reduce retrieval failures. |
Datasets and Benchmarks
- Evaluation
  - HotpotQA - hop QA requiring reasoning over multiple documents. |
  - MS MARCO - scale passage retrieval and QA benchmark. |
  - BEIR - shot evaluation of retrieval models across diverse tasks. |
  - RGB (Retrieval-Augmented Generation Benchmark)
Embedding Models
- Evaluation
  - text-embedding-3-small/large - purpose embeddings. `large` variant has 3072 dimensions. |
  - Cohere Embed v3
  - BGE (BAAI) - source | Top-performing open-source embeddings. Available in multiple sizes. |
  - E5-Mistral - source | LLM-based embedding model. Strong performance on MTEB benchmark. |
  - Nomic Embed - source | Long-context (8192 tokens), fully open-source with open training data. |
  - Jina Embeddings - source | Multilingual, supports 8192 token context. Good for non-English corpora. |
  - MTEB Leaderboard - to-date embedding model benchmarks.
Foundational Papers
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - The original RAG paper by Lewis et al. (Meta AI). Introduces the RAG architecture combining a pre-trained seq2seq model with a dense retriever (DPR).
- Dense Passage Retrieval for Open-Domain Question Answering - DPR — the dense retrieval method that underpins many RAG systems.
- REALM: Retrieval-Augmented Language Model Pre-Training - Pre-trains a language model jointly with a knowledge retriever.
- Attention Is All You Need - The Transformer architecture — foundational to all modern LLMs used in RAG.
Frameworks and Libraries
- Evaluation
  - LangChain
  - LlamaIndex
  - Haystack - ready NLP framework by deepset. Pipeline-based architecture. |
  - RAGFlow - source RAG engine with deep document understanding and chunk visualization. |
  - Verba - source RAG chatbot powered by Weaviate. Good for quick prototyping. |
  - Cognita - source modular RAG framework for production use. |
Survey Papers
- Retrieval-Augmented Generation for Large Language Models: A Survey - Comprehensive survey covering Naive RAG, Advanced RAG, and Modular RAG paradigms. Excellent starting point.
- A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models - Covers the evolution of RA-LLMs, taxonomies, and training strategies.
- Seven Failure Points When Engineering a RAG System - Practical guide to what can go wrong in RAG pipelines — highly recommended for thesis work.
Tutorials and Guides
- Evaluation
  - RAG From Scratch (LangChain) - Series of notebooks covering RAG concepts from basics to advanced patterns.
  - Building RAG Applications with LlamaIndex - Official LlamaIndex documentation and conceptual guide.
  - Pinecone RAG Learning Center - Well-written introduction to RAG with practical examples.
  - Full Stack RAG App Tutorial (freeCodeCamp) - Video walkthrough of building a complete RAG application.
  - Building RAG Applications with LlamaIndex - Official LlamaIndex documentation and conceptual guide.
Vector Databases
- Evaluation
  - Chroma
  - Weaviate - hosted / Cloud | Supports hybrid search natively. GraphQL API. |
  - Qdrant - hosted / Cloud | Written in Rust. Excellent filtering and payload support. |
  - Milvus - hosted / Cloud | Highly scalable. Used in many production deployments. |
  - Pinecone - only | Fully managed. Simple API. Popular in industry. |
  - FAISS
  - pgvector
Videos and Talks
- Evaluation
  - But what is RAG? (3Blue1Brown-style explainer) - Visual, intuitive explanation of how RAG works.
  - RAG is Dead? Long Live RAG! (Keynote) - Discussion on the future of RAG vs. long-context models.
  - Building Production RAG (AI Engineer Summit) - Practical lessons from deploying RAG at scale.
  - Advanced RAG Techniques (DeepLearning.AI) - Short course by Andrew Ng's platform.

Programming Languages

Python 8 Rust 2 Go 2 TypeScript 2 JavaScript 1 HTML 1 Jupyter Notebook 1 C++ 1 C 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-rag-study

Advanced Techniques

Chunking and Indexing

Evaluation

Query Transformation

Reranking

Retrieval Strategies

Datasets and Benchmarks

Evaluation

Embedding Models

Evaluation

Foundational Papers

Frameworks and Libraries

Evaluation

Survey Papers

Tutorials and Guides

Evaluation

Vector Databases

Evaluation

Videos and Talks

Evaluation