Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
https://github.com/ict-bigdatalab/awesome-pretrained-models-for-information-retrieval
Last synced: 5 days ago
JSON representation
-
LLM and IR
-
LLM for IR
- Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. - PaLM2-S for keywords generation**)
- Teaching language models to support answers with verified quotes.
- Evaluating Verifiability in Generative Search Engines. - liu/evaluating-verifiability-in-generative-search-engines)]
- Large Search Model: Redefining Search Stack in the Era of LLMs. - ->
- Improving Passage Retrieval with Zero-Shot Question Generation. - passage-reranking)](**UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B**)
- Promptagator: Few-shot Dense Retrieval From 8 Examples. - context learning, FLAN 137B**)
- UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. - Falcon, Omar Khattab et.al.* Arxiv 2023. [[code](https://github.com/primeqa/primeqa)](**Train reranker with generated pseudo quereis with GPT3**)
- InPars: Data Augmentation for Information Retrieval using Large Language Models. - v1)](**Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs**)
- InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. - v2)](**silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector**)
- InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. - J 6B and BLOOM 7B**)
- Generative Relevance Feedback with Large Language Models.
- Query Expansion by Prompting Large Language Models.
- Exploring the Viability of Synthetic Query Generation for Relevance Prediction. - 137B label conditioned generation**)
- Large Language Model based Long-tail Query Rewriting in Taobao Search.
- Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. - PaLM2-S for keywords generation**)
- Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval.
- Generate rather than Retrieve: Large Language Models are Strong Context Generators.
- Recitation-Augmented Language Models. - Sun/RECITE)] (**similar to GenRead**)
- Precise Zero-Shot Dense Retrieval without Relevance Labels.
- Query2doc: Query Expansion with Large Language Models. - context learning and then concat with queries, text-davinci-003**)
- Large Language Models are Strong Zero-Shot Retriever.
- Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. - askari/ChatGPT-RetrievalQA)] (**Ranking with synthetic data generated by ChatGPT**)
- Task-aware Retrieval with Instructions. - T5**)
- One Embedder, Any Task: Instruction-Finetuned Text Embeddings. - embedding)](**Intructor, 330 diverse tasks, 1.5B model**)
- ExaRanker: Explanation-Augmented Neural Ranker. - dl/ExaRanker)] (**Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002)**)
- Perspectives on Large Language Models for Relevance Judgment.
- Zero-Shot Listwise Document Reranking with a Large Language Model.
- Large Language Models are Built-in Autoregressive Search Engines. - URL, use GPT-3 text-davinci-003 to generate URL, model-based IR**)
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. - shot Passage reranking with ChatGPT/GPT4**)
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
- RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models.
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. - sc)]
- Fine-Tuning LLaMA for Multi-Stage Text Retrieval.
- A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models.
- Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. - qlm)]
- PaRaDe: Passage Ranking using Demonstrations with Large Language Models.
- Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels.
- Large Language Models can Accurately Predict Searcher Preferences.
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models.
- ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. - 3.5 generate keyphrases**)
- WebGPT: Browser-assisted question-answering with human feedback.
- Teaching language models to support answers with verified quotes.
- Evaluating Verifiability in Generative Search Engines. - liu/evaluating-verifiability-in-generative-search-engines)]
- Enabling Large Language Models to Generate Text with Citations. - nlp/ALCE)] (**ALCE benchmark**)
- FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.
- Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher.
- Evaluating Generative Ad Hoc Information Retrieval.
- Zero-Shot Listwise Document Reranking with a Large Language Model.
- ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. - 3.5 generate keyphrases**)
- Retrieve Anything To Augment Large Language Models.
- UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. - Falcon, Omar Khattab et.al.* Arxiv 2023. [[code](https://github.com/primeqa/primeqa)](**Train reranker with generated pseudo quereis with GPT3**)
- Large Search Model: Redefining Search Stack in the Era of LLMs. - ->
- Improving Passage Retrieval with Zero-Shot Question Generation. - passage-reranking)](**UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B**)
- Promptagator: Few-shot Dense Retrieval From 8 Examples. - context learning, FLAN 137B**)
- InPars: Data Augmentation for Information Retrieval using Large Language Models. - v1)](**Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs**)
- InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. - v2)](**silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector**)
- InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. - J 6B and BLOOM 7B**)
- Generative Relevance Feedback with Large Language Models.
- Query Expansion by Prompting Large Language Models.
- Exploring the Viability of Synthetic Query Generation for Relevance Prediction. - 137B label conditioned generation**)
- Large Language Model based Long-tail Query Rewriting in Taobao Search.
- Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval.
- Generate rather than Retrieve: Large Language Models are Strong Context Generators.
- Recitation-Augmented Language Models. - Sun/RECITE)] (**similar to GenRead**)
- Precise Zero-Shot Dense Retrieval without Relevance Labels.
- A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models.
- Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. - qlm)]
- PaRaDe: Passage Ranking using Demonstrations with Large Language Models.
- Query2doc: Query Expansion with Large Language Models. - context learning and then concat with queries, text-davinci-003**)
- Large Language Models are Strong Zero-Shot Retriever.
- Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. - askari/ChatGPT-RetrievalQA)] (**Ranking with synthetic data generated by ChatGPT**)
- ExaRanker: Explanation-Augmented Neural Ranker. - dl/ExaRanker)] (**Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002)**)
- Perspectives on Large Language Models for Relevance Judgment.
- Large Language Models are Built-in Autoregressive Search Engines. - URL, use GPT-3 text-davinci-003 to generate URL, model-based IR**)
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. - shot Passage reranking with ChatGPT/GPT4**)
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
- RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models.
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. - sc)]
- Fine-Tuning LLaMA for Multi-Stage Text Retrieval.
- Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels.
- Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models.
- Enabling Large Language Models to Generate Text with Citations. - nlp/ALCE)] (**ALCE benchmark**)
- FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.
- Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher.
- Evaluating Generative Ad Hoc Information Retrieval.
- Large Language Models can Accurately Predict Searcher Preferences.
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- WebGPT: Browser-assisted question-answering with human feedback.
- Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.
- Retrieve Anything To Augment Large Language Models.
- Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.
- Task-aware Retrieval with Instructions. - T5**)
- One Embedder, Any Task: Instruction-Finetuned Text Embeddings. - embedding)](**Intructor, 330 diverse tasks, 1.5B model**)
-
Perspectives or Surveys
- Information Retrieval meets Large Language Models: A strategic report from Chinese IR community.
- Large Language Models for Information Retrieval: A Survey.
- Information Retrieval meets Large Language Models: A strategic report from Chinese IR community.
- Large Language Models for Information Retrieval: A Survey.
-
Retrieval Augmented LLM
- Improving Language Models by Retrieving from Trillions of Tokens. - dec 7.5B**)
- Atlas: Few-shot Learning with Retrieval Augmented Language Models.
- Internet-augmented language models through few-shot prompting for open-domain question answering.
- Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy.
- Instruction Tuning post Retrieval-Augmented Pretraining.
- Retrieval-augmented generation for knowledge-intensive NLP tasks.
- Internet-augmented language models through few-shot prompting for open-domain question answering.
- Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy.
- Instruction Tuning post Retrieval-Augmented Pretraining.
-
-
Multimodal Retrieval
-
Multi-stream Architecture Applied on Input
- M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. - v0/InterBERT**)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. - multi-task)] (**VilBERT**)
- 12-in-1: Multi-Task Vision and Language Representation Learning. - multi-task)] (**A multi-task model based on VilBERT**)
- Learning Transferable Visual Models From Natural Language Supervision.
- M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. - v0/InterBERT**)
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training.
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training.
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. - vil)] (**ERNIE-ViL,1st place on the VCR leaderboard**)
- Learning Transferable Visual Models From Natural Language Supervision.
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. - vil)] (**ERNIE-ViL,1st place on the VCR leaderboard**)
- 12-in-1: Multi-Task Vision and Language Representation Learning. - multi-task)] (**A multi-task model based on VilBERT**)
-
Unified Single-stream Architecture
- XGPT: Cross-modal Generative Pre-Training for Image Captioning.
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. - VL**)
- XGPT: Cross-modal Generative Pre-Training for Image Captioning.
- UNITER: UNiversal Image-TExt Representation Learning. - Chun Chen, Linjie Li et.al.* ECCV 2020. [[code](https://github.com/ChenRocks/UNITER)] (**UNITER**)
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
- VinVL: Making Visual Representations Matter in Vision-Language Models.
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
- Dynamic Modality Interaction Modeling for Image-Text Retrieval.
- VinVL: Making Visual Representations Matter in Vision-Language Models.
-
-
Survey Papers
- Pre-training Methods in Information Retrieval.
- Dense Text Retrieval based on Pretrained Language Models: A Survey.
- Pretrained Transformers for Text Ranking: BERT and Beyond.
- Semantic Models for the First-stage Retrieval: A Comprehensive Review.
- A Deep Look into neural ranking models for information retrieval.
- Dense Text Retrieval based on Pretrained Language Models: A Survey.
- Pre-training Methods in Information Retrieval.
- Semantic Models for the First-stage Retrieval: A Comprehensive Review.
-
First Stage Retrieval
-
Sparse Retrieval
- Learning Term Discrimination. - reweighting**)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.
- Learning Passage Impacts for Inverted Indexes.
- Document Expansion by Query Prediction. - dl/dl4ir-doc2query), [docTTTTTquery code](https://github.com/castorini/docTTTTTquery)] (**doc2query, docTTTTTquery**)
- Generation-Augmented Retrieval for Open-Domain Question Answering.
- Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation.
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. - Rok Jang et.al.* EMNLP 2021. (**UHD**)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. - ousia/bpr)] (**BPR, convert embedding vector to binary codes**)
- Learning Term Discrimination. - reweighting**)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.
- Learning Passage Impacts for Inverted Indexes.
- Document Expansion by Query Prediction. - dl/dl4ir-doc2query), [docTTTTTquery code](https://github.com/castorini/docTTTTTquery)] (**doc2query, docTTTTTquery**)
- Generation-Augmented Retrieval for Open-Domain Question Answering.
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. - Rok Jang et.al.* EMNLP 2021. (**UHD**)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. - ousia/bpr)] (**BPR, convert embedding vector to binary codes**)
- Context-Aware Document Term Weighting for Ad-Hoc Search.
- Context-Aware Term Weighting For First Stage Passage Retrieval.
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.
-
Dense Retrieval
- Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives**)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. - Index)] (**RepBERT**)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives, denoise hard negatives and data augementation**)
- Optimizing Dense Retrieval Model Training with Hard Negatives. - side finetuning build on pretrained document encoders**)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. - hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval - PAIR)] (**PAIR**)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. - futuredata/ColBERT)] (**ColBERT**)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. - encoders**)
- Sparse, Dense, and Attentional Representations for Text Retrieval. - BERT, multi-vectors**)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.
- Learning Dense Representations of Phrases at Scale.
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval.
- Multivariate Representation Learning for Information Retrieval.
- Distilling Knowledge from Reader to Retriever for Question Answering. - retriever-pytorch)] (**Distill cross-attention of reader to retriever**)
- Distilling Knowledge for Fast Retrieval-based Chat-bots. - encoders to bi-encoders**)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. - hofstaetter/neural-ranking-kd)] (**Distill from BERT ensemble**)
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. - Chieh Lin, Jheng-Hong Yang, Jimmy Lin.* Arxiv 2020. [[code](https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md)] (**TCTColBERT: distill from ColBERT**)
- Latent Retrieval for Weakly Supervised Open Domain Question Answering. - research/language/blob/master/language/orqa/README.md)] (**ORQA, ICT**)
- Pre-training tasks for embedding-based large scale retrieval. - Cheng Chang et.al.* ICLR 2020. (**ICT, BFS and WLP**)
- REALM: Retrieval-Augmented Language Model Pre-Training. - research/language/blob/master/language/realm/README.md)] (**REALM**)
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. - Encoder/)] (**Seed**)
- Condenser: a Pre-training Architecture for Dense Retrieval.
- Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. - lingual pre-training**)
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.
- LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.
- A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. - based contrastive pretraining**)
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - Ma/COSTA)](**COSTA, group-wise contrastive learning**)
- Structure and Semantics Preserving Document Representations.
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning.
- Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation.
- Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.
- Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance.
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.
- Matchingoriented Embedding Quantization For Ad-hoc Retrieval.
- Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings.
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.
- Multi-Task Retrieval for Knowledge-Intensive Tasks. - task learning**)
- Evaluating Extrapolation Performance of Dense Retrieval. - eval)]
- PseudoRelevance Feedback for Multiple Representation Dense Retrieval. - PRF**)
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. - PRF)] (**ANCE-PRF**)
- LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. - relevance feedback**)
- Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. - DR)] (**CoRocchio, Counterfactual Rocchio algorithm**)
- Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. - Index)] (**RepBERT**)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives, denoise hard negatives and data augementation**)
- Optimizing Dense Retrieval Model Training with Hard Negatives. - side finetuning build on pretrained document encoders**)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. - hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval - PAIR)] (**PAIR**)
- REALM: Retrieval-Augmented Language Model Pre-Training. - research/language/blob/master/language/realm/README.md)] (**REALM**)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. - futuredata/ColBERT)] (**ColBERT**)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. - encoders**)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.
- Learning Dense Representations of Phrases at Scale.
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval.
- Multivariate Representation Learning for Information Retrieval.
- Distilling Knowledge from Reader to Retriever for Question Answering. - retriever-pytorch)] (**Distill cross-attention of reader to retriever**)
- Distilling Knowledge for Fast Retrieval-based Chat-bots. - encoders to bi-encoders**)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. - hofstaetter/neural-ranking-kd)] (**Distill from BERT ensemble**)
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. - Chieh Lin, Jheng-Hong Yang, Jimmy Lin.* Arxiv 2020. [[code](https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md)] (**TCTColBERT: distill from ColBERT**)
- Pre-training tasks for embedding-based large scale retrieval. - Cheng Chang et.al.* ICLR 2020. (**ICT, BFS and WLP**)
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. - Encoder/)] (**Seed**)
- Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. - lingual pre-training**)
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.
- LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.
- A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. - based contrastive pretraining**)
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - Ma/COSTA)](**COSTA, group-wise contrastive learning**)
- Structure and Semantics Preserving Document Representations.
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning.
- Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.
- Matchingoriented Embedding Quantization For Ad-hoc Retrieval.
- Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings.
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.
- Evaluating Extrapolation Performance of Dense Retrieval. - eval)]
- PseudoRelevance Feedback for Multiple Representation Dense Retrieval. - PRF**)
- LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. - relevance feedback**)
- Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. - DR)] (**CoRocchio, Counterfactual Rocchio algorithm**)
- Latent Retrieval for Weakly Supervised Open Domain Question Answering. - research/language/blob/master/language/orqa/README.md)] (**ORQA, ICT**)
- Sparse, Dense, and Attentional Representations for Text Retrieval. - BERT, multi-vectors**)
- Condenser: a Pre-training Architecture for Dense Retrieval.
- Multi-Task Retrieval for Knowledge-Intensive Tasks. - task learning**)
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. - PRF)] (**ANCE-PRF**)
- Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.
-
Hybrid Retrieval
- Complement Lexical Retrieval Model with Semantic Residual Embeddings.
- Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.
- Complement Lexical Retrieval Model with Semantic Residual Embeddings.
- Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.
-
-
Re-ranking Stage
-
Basic Usage
- Understanding the Behaviors of BERT in Ranking. - focused and Interanction-focused**)
- Passage Re-ranking with BERT. - dl/dl4marco-bert)] (**monoBERT: Maybe the first work on applying BERT to IR**)
- Multi-Stage Document Ranking with BERT, - Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.](https://arxiv.org/pdf/2101.05667.pdf) *Rodrigo Nogueira et.al.* Arxiv 2020. (**Expando-Mono-Duo: doc2query+pointwise+pairwise**)
- CEDR: Contextualized Embeddings for Document Ranking. - IR-Lab/cedr)] (**CEDR: BERT+neuIR model**)
- Beyond [CLS
- Document Ranking with a Pretrained Sequence-to-Sequence Model.
- RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses.
- Generalizing Discriminative Retrieval Models using Generative Tasks.
- Understanding the Behaviors of BERT in Ranking. - focused and Interanction-focused**)
- CEDR: Contextualized Embeddings for Document Ranking. - IR-Lab/cedr)] (**CEDR: BERT+neuIR model**)
- Document Ranking with a Pretrained Sequence-to-Sequence Model.
- RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses.
- Passage Re-ranking with BERT. - dl/dl4marco-bert)] (**monoBERT: Maybe the first work on applying BERT to IR**)
- Multi-Stage Document Ranking with BERT, - Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.](https://arxiv.org/pdf/2101.05667.pdf) *Rodrigo Nogueira et.al.* Arxiv 2020. (**Expando-Mono-Duo: doc2query+pointwise+pairwise**)
- Beyond [CLS
-
Long Document Processing Techniques
- Deeper Text Understanding for IR with Contextual Neural Language Modeling. - BERT-IR)] (**BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level**)
- Simple Applications of BERT for Ad Hoc Document Retrieval, - 3004.pdf) [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval.](https://www.aclweb.org/anthology/D19-1352.pdf) *Wei Yang, Haotian Zhang et.al.* Arxiv 2020, *Zeynep Akkalyoncu Yilmaz et.al.* EMNLP 2019 short. [[code](https://github.com/castorini/birch)] (**Birch: Sentence-level**)
- Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. - hofstaetter/intra-document-cascade)] (**Distill a ranking model to conv-knrm to select top-k passages**)
- PARADE: Passage Representation Aggregation for Document Reranking.
- Local Self-Attention over Long Text for Efficient Document Retrieval. - hofstaetter/transformer-kernel-ranking)] (**TKL:Transformer-Kernel for long text**)
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. - research/google-research/tree/master/smith)] (**SMITH for doc2doc matching**)
- Socialformer: Social Network Inspired Long Document Modeling for Document Ranking.
- Simple Applications of BERT for Ad Hoc Document Retrieval, - 3004.pdf) [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval.](https://www.aclweb.org/anthology/D19-1352.pdf) *Wei Yang, Haotian Zhang et.al.* Arxiv 2020, *Zeynep Akkalyoncu Yilmaz et.al.* EMNLP 2019 short. [[code](https://github.com/castorini/birch)] (**Birch: Sentence-level**)
- Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. - hofstaetter/intra-document-cascade)] (**Distill a ranking model to conv-knrm to select top-k passages**)
- Deeper Text Understanding for IR with Contextual Neural Language Modeling. - BERT-IR)] (**BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level**)
- PARADE: Passage Representation Aggregation for Document Reranking.
- Local Self-Attention over Long Text for Efficient Document Retrieval. - hofstaetter/transformer-kernel-ranking)] (**TKL:Transformer-Kernel for long text**)
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. - research/google-research/tree/master/smith)] (**SMITH for doc2doc matching**)
- Socialformer: Social Network Inspired Long Document Modeling for Document Ranking.
-
Improving Efficiency
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. - BERT**)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. - IR-Lab/prettr-neural-ir)] (**PreTTR**)
- Modularized Transfomer-based Ranking Framework.
- Fast Forward Indexes for Efficient Document Ranking.
- Understanding BERT Rankers Under Distillation.
- Simplified TinyBERT: Knowledge Distillation for Document Retrieval. - unique/Simplified-TinyBERT)] (**TinyBERT+knowledge distillation**)
- Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. - Tuning**)
- Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval.
- The Cascade Transformer: an Application for Efficient Answer Sentence Selection. - cascade-transformers)] (**Cascade Transformer: prune candidates by layer**)
- Fast Forward Indexes for Efficient Document Ranking.
- Understanding BERT Rankers Under Distillation.
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. - BERT**)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. - IR-Lab/prettr-neural-ir)] (**PreTTR**)
- Modularized Transfomer-based Ranking Framework.
- Simplified TinyBERT: Knowledge Distillation for Document Retrieval. - unique/Simplified-TinyBERT)] (**TinyBERT+knowledge distillation**)
- Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. - Tuning**)
- Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval.
- The Cascade Transformer: an Application for Efficient Answer Sentence Selection. - cascade-transformers)] (**Cascade Transformer: prune candidates by layer**)
-
Other Topics
- BERT-QE: Contextualized Query Expansion for Document Re-ranking. - zheng/BERT-QE)] (**BERT-QE**)
- Training Curricula for Open Domain Answer Re-Ranking. - IR-Lab/curricula-neural-ir)] (**curriculum learning based on BM25**)
- Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models.
- Selective Weak Supervision for Neural Information Retrieval.
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**PROP**)
- B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**B-PROP**)
- Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need.
- Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking.
- Pre-trained Language Model based Ranking in Baidu Search.
- A Unified Pretraining Framework for Passage Ranking and Expansion.
- Axiomatically Regularized Pre-training for Ad hoc Search.
- PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models.
- Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models
- Are Neural Ranking Models Robust?
- Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models
- Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. - An Liu et.al.* SIGIR 2023.
- Cross-lingual Retrieval for Iterative Self-Supervised Training.
- CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. - CLIRMatrix and multilingual BERT**)
- BERT-QE: Contextualized Query Expansion for Document Re-ranking. - zheng/BERT-QE)] (**BERT-QE**)
- Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models.
- Selective Weak Supervision for Neural Information Retrieval.
- Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking.
- Pre-trained Language Model based Ranking in Baidu Search.
- Are Neural Ranking Models Robust?
- Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models
- Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. - An Liu et.al.* SIGIR 2023.
- Cross-lingual Retrieval for Iterative Self-Supervised Training.
- PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models.
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**PROP**)
- B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**B-PROP**)
- Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models
- Training Curricula for Open Domain Answer Re-Ranking. - IR-Lab/curricula-neural-ir)] (**curriculum learning based on BM25**)
-
-
Jointly Learning Retrieval and Re-ranking
-
Other Topics
- Adversarial Retriever-Ranker for dense text retrieval.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.
- RankFlow: Joint Optimization of Multi-Stage Cascade Ranking Systems as Flows.
- Adversarial Retriever-Ranker for dense text retrieval.
-
-
Model-based IR System
-
Other Topics
- Rethinking Search: Making Domain Experts out of Dilettantes.
- Transformer Memory as a Differentiable Search Index.
- DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index.
- A Neural Corpus Indexer for Document Retrieval.
- Autoregressive Search Engines: Generating Substrings as Document Identifiers.
- CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. - bigdatalab/CorpusBrain)] (**CorpusBrain**)
- A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. - bigdatalab/UGR)] (**UGR**)
- TOME: A Two-stage Approach for Model-based Retrieval.
- How Does Generative Retrieval Scale to Millions of Passages?
- Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. - Enhanced DSI**)
- DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index.
- Autoregressive Search Engines: Generating Substrings as Document Identifiers.
- Transformer Memory as a Differentiable Search Index.
- A Neural Corpus Indexer for Document Retrieval.
- CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. - bigdatalab/CorpusBrain)] (**CorpusBrain**)
- A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. - bigdatalab/UGR)] (**UGR**)
- TOME: A Two-stage Approach for Model-based Retrieval.
- How Does Generative Retrieval Scale to Millions of Passages?
- Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. - Enhanced DSI**)
- Rethinking Search: Making Domain Experts out of Dilettantes.
-
-
Other Resources
-
Other Resources About Pre-trained Models in NLP
-
Surveys About Efficient Transformers
-
Some Retrieval Toolkits
-
Categories
Sub Categories
LLM for IR
94
Dense Retrieval
85
Other Topics
57
Sparse Retrieval
23
Improving Efficiency
18
Basic Usage
15
Long Document Processing Techniques
14
Multi-stream Architecture Applied on Input
11
Unified Single-stream Architecture
9
Retrieval Augmented LLM
9
Hybrid Retrieval
6
Perspectives or Surveys
4
Other Resources About Pre-trained Models in NLP
3
Some Retrieval Toolkits
3
Surveys About Efficient Transformers
2