awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
https://github.com/ict-bigdatalab/awesome-pretrained-models-for-information-retrieval

Last synced: about 14 hours ago
JSON representation

LLM and IR
- LLM for IR
  - Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. - PaLM2-S for keywords generation**)
  - Teaching language models to support answers with verified quotes.
  - Evaluating Verifiability in Generative Search Engines. - liu/evaluating-verifiability-in-generative-search-engines)]
  - InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. - v2)](**silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector**)
  - Large Search Model: Redefining Search Stack in the Era of LLMs. - ->
  - Improving Passage Retrieval with Zero-Shot Question Generation. - passage-reranking)](**UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B**)
  - Promptagator: Few-shot Dense Retrieval From 8 Examples. - context learning, FLAN 137B**)
  - UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. - Falcon, Omar Khattab et.al.* Arxiv 2023. [[code](https://github.com/primeqa/primeqa)](**Train reranker with generated pseudo quereis with GPT3**)
  - InPars: Data Augmentation for Information Retrieval using Large Language Models. - v1)](**Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs**)
  - InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. - J 6B and BLOOM 7B**)
  - Generative Relevance Feedback with Large Language Models.
  - Query Expansion by Prompting Large Language Models.
  - Exploring the Viability of Synthetic Query Generation for Relevance Prediction. - 137B label conditioned generation**)
  - Large Language Model based Long-tail Query Rewriting in Taobao Search.
  - Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. - PaLM2-S for keywords generation**)
  - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval.
  - Generate rather than Retrieve: Large Language Models are Strong Context Generators.
  - Recitation-Augmented Language Models. - Sun/RECITE)] (**similar to GenRead**)
  - Precise Zero-Shot Dense Retrieval without Relevance Labels.
  - Query2doc: Query Expansion with Large Language Models. - context learning and then concat with queries, text-davinci-003**)
  - Large Language Models are Strong Zero-Shot Retriever.
  - Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. - askari/ChatGPT-RetrievalQA)] (**Ranking with synthetic data generated by ChatGPT**)
  - Task-aware Retrieval with Instructions. - T5**)
  - One Embedder, Any Task: Instruction-Finetuned Text Embeddings. - embedding)](**Intructor, 330 diverse tasks, 1.5B model**)
  - ExaRanker: Explanation-Augmented Neural Ranker. - dl/ExaRanker)] (**Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002)**)
  - Perspectives on Large Language Models for Relevance Judgment.
  - Zero-Shot Listwise Document Reranking with a Large Language Model.
  - Large Language Models are Built-in Autoregressive Search Engines. - URL, use GPT-3 text-davinci-003 to generate URL, model-based IR**)
  - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
  - RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models.
  - Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. - sc)]
  - Fine-Tuning LLaMA for Multi-Stage Text Retrieval.
  - A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models.
  - Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. - qlm)]
  - PaRaDe: Passage Ranking using Demonstrations with Large Language Models.
  - Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels.
  - Large Language Models can Accurately Predict Searcher Preferences.
  - RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
  - Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models.
  - ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. - 3.5 generate keyphrases**)
  - Teaching language models to support answers with verified quotes.
  - Evaluating Verifiability in Generative Search Engines. - liu/evaluating-verifiability-in-generative-search-engines)]
  - Enabling Large Language Models to Generate Text with Citations. - nlp/ALCE)] (**ALCE benchmark**)
  - FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.
  - Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher.
  - Evaluating Generative Ad Hoc Information Retrieval.
  - Zero-Shot Listwise Document Reranking with a Large Language Model.
  - ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. - 3.5 generate keyphrases**)
  - Retrieve Anything To Augment Large Language Models.
  - Large Search Model: Redefining Search Stack in the Era of LLMs. - ->
  - Improving Passage Retrieval with Zero-Shot Question Generation. - passage-reranking)](**UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B**)
  - Promptagator: Few-shot Dense Retrieval From 8 Examples. - context learning, FLAN 137B**)
  - UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. - Falcon, Omar Khattab et.al.* Arxiv 2023. [[code](https://github.com/primeqa/primeqa)](**Train reranker with generated pseudo quereis with GPT3**)
  - InPars: Data Augmentation for Information Retrieval using Large Language Models. - v1)](**Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs**)
  - InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. - v2)](**silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector**)
  - InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. - J 6B and BLOOM 7B**)
  - Generative Relevance Feedback with Large Language Models.
  - Query Expansion by Prompting Large Language Models.
  - Exploring the Viability of Synthetic Query Generation for Relevance Prediction. - 137B label conditioned generation**)
  - Large Language Model based Long-tail Query Rewriting in Taobao Search.
  - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval.
  - Generate rather than Retrieve: Large Language Models are Strong Context Generators.
  - Recitation-Augmented Language Models. - Sun/RECITE)] (**similar to GenRead**)
  - Precise Zero-Shot Dense Retrieval without Relevance Labels.
  - Fine-Tuning LLaMA for Multi-Stage Text Retrieval.
  - A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models.
  - Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. - qlm)]
  - Query2doc: Query Expansion with Large Language Models. - context learning and then concat with queries, text-davinci-003**)
  - Large Language Models are Strong Zero-Shot Retriever.
  - Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. - askari/ChatGPT-RetrievalQA)] (**Ranking with synthetic data generated by ChatGPT**)
  - ExaRanker: Explanation-Augmented Neural Ranker. - dl/ExaRanker)] (**Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002)**)
  - Perspectives on Large Language Models for Relevance Judgment.
  - Large Language Models are Built-in Autoregressive Search Engines. - URL, use GPT-3 text-davinci-003 to generate URL, model-based IR**)
  - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
  - RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models.
  - Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. - sc)]
  - PaRaDe: Passage Ranking using Demonstrations with Large Language Models.
  - Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels.
  - Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models.
  - Enabling Large Language Models to Generate Text with Citations. - nlp/ALCE)] (**ALCE benchmark**)
  - FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.
  - Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher.
  - Evaluating Generative Ad Hoc Information Retrieval.
  - WebGPT: Browser-assisted question-answering with human feedback.
  - Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.
  - Retrieve Anything To Augment Large Language Models.
  - Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.
  - Task-aware Retrieval with Instructions. - T5**)
  - One Embedder, Any Task: Instruction-Finetuned Text Embeddings. - embedding)](**Intructor, 330 diverse tasks, 1.5B model**)
  - Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. - shot Passage reranking with ChatGPT/GPT4**)
  - Large Language Models can Accurately Predict Searcher Preferences.
  - RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- Perspectives or Surveys
- Retrieval Augmented LLM
Multimodal Retrieval
- Unified Single-stream Architecture
  - XGPT: Cross-modal Generative Pre-Training for Image Captioning.
  - Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. - VL**)
  - XGPT: Cross-modal Generative Pre-Training for Image Captioning.
  - UNITER: UNiversal Image-TExt Representation Learning. - Chun Chen, Linjie Li et.al.* ECCV 2020. [[code](https://github.com/ChenRocks/UNITER)] (**UNITER**)
  - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
  - VinVL: Making Visual Representations Matter in Vision-Language Models.
  - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
  - Dynamic Modality Interaction Modeling for Image-Text Retrieval.
  - VinVL: Making Visual Representations Matter in Vision-Language Models.
  - Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. - VL**)
  - UNITER: UNiversal Image-TExt Representation Learning. - Chun Chen, Linjie Li et.al.* ECCV 2020. [[code](https://github.com/ChenRocks/UNITER)] (**UNITER**)
- Multi-stream Architecture Applied on Input
  - ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. - multi-task)] (**VilBERT**)
  - 12-in-1: Multi-Task Vision and Language Representation Learning. - multi-task)] (**A multi-task model based on VilBERT**)
  - Learning Transferable Visual Models From Natural Language Supervision.
  - M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. - v0/InterBERT**)
  - M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training.
  - M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training.
  - ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. - vil)] (**ERNIE-ViL，1st place on the VCR leaderboard**)
  - Learning Transferable Visual Models From Natural Language Supervision.
  - ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. - vil)] (**ERNIE-ViL，1st place on the VCR leaderboard**)
  - ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. - multi-task)] (**VilBERT**)
  - M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. - v0/InterBERT**)
  - 12-in-1: Multi-Task Vision and Language Representation Learning. - multi-task)] (**A multi-task model based on VilBERT**)
Survey Papers
First Stage Retrieval
- Sparse Retrieval
  - Learning Term Discrimination. - reweighting**)
  - COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.
  - Learning Passage Impacts for Inverted Indexes.
  - Document Expansion by Query Prediction. - dl/dl4ir-doc2query), [docTTTTTquery code](https://github.com/castorini/docTTTTTquery)] (**doc2query, docTTTTTquery**)
  - Generation-Augmented Retrieval for Open-Domain Question Answering.
  - Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation.
  - SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.
  - Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.
  - SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
  - Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. - Rok Jang et.al.* EMNLP 2021. (**UHD**)
  - Efficient Passage Retrieval with Hashing for Open-domain Question Answering. - ousia/bpr)] (**BPR, convert embedding vector to binary codes**)
  - Learning Term Discrimination. - reweighting**)
  - COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.
  - Learning Passage Impacts for Inverted Indexes.
  - Document Expansion by Query Prediction. - dl/dl4ir-doc2query), [docTTTTTquery code](https://github.com/castorini/docTTTTTquery)] (**doc2query, docTTTTTquery**)
  - Generation-Augmented Retrieval for Open-Domain Question Answering.
  - Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.
  - SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
  - Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. - Rok Jang et.al.* EMNLP 2021. (**UHD**)
  - Efficient Passage Retrieval with Hashing for Open-domain Question Answering. - ousia/bpr)] (**BPR, convert embedding vector to binary codes**)
  - Learning to Reweight Terms with Distributed Representations.
  - Context-Aware Document Term Weighting for Ad-Hoc Search.
  - Context-Aware Term Weighting For First Stage Passage Retrieval.
  - SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.
- Dense Retrieval
  - RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. - Index)] (**RepBERT**)
  - Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.
  - RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives, denoise hard negatives and data augementation**)
  - Optimizing Dense Retrieval Model Training with Hard Negatives. - side finetuning build on pretrained document encoders**)
  - Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. - hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
  - PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval - PAIR)] (**PAIR**)
  - ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. - futuredata/ColBERT)] (**ColBERT**)
  - Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. - encoders**)
  - Sparse, Dense, and Attentional Representations for Text Retrieval. - BERT, multi-vectors**)
  - Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.
  - Learning Dense Representations of Phrases at Scale.
  - Multi-View Document Representation Learning for Open-Domain Dense Retrieval.
  - Multivariate Representation Learning for Information Retrieval.
  - Distilling Knowledge from Reader to Retriever for Question Answering. - retriever-pytorch)] (**Distill cross-attention of reader to retriever**)
  - Distilling Knowledge for Fast Retrieval-based Chat-bots. - encoders to bi-encoders**)
  - Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. - hofstaetter/neural-ranking-kd)] (**Distill from BERT ensemble**)
  - Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. - Chieh Lin, Jheng-Hong Yang, Jimmy Lin.* Arxiv 2020. [[code](https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md)] (**TCTColBERT: distill from ColBERT**)
  - Pre-training tasks for embedding-based large scale retrieval. - Cheng Chang et.al.* ICLR 2020. (**ICT, BFS and WLP**)
  - REALM: Retrieval-Augmented Language Model Pre-Training. - research/language/blob/master/language/realm/README.md)] (**REALM**)
  - Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. - Encoder/)] (**Seed**)
  - Condenser: a Pre-training Architecture for Dense Retrieval.
  - Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. - lingual pre-training**)
  - Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.
  - LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.
  - A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. - based contrastive pretraining**)
  - Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - Ma/COSTA)](**COSTA, group-wise contrastive learning**)
  - Structure and Semantics Preserving Document Representations.
  - Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning.
  - Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation.
  - Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.
  - Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance.
  - Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.
  - Matchingoriented Embedding Quantization For Ad-hoc Retrieval.
  - Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings.
  - Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.
  - Multi-Task Retrieval for Knowledge-Intensive Tasks. - task learning**)
  - Evaluating Extrapolation Performance of Dense Retrieval. - eval)]
  - PseudoRelevance Feedback for Multiple Representation Dense Retrieval. - PRF**)
  - Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. - PRF)] (**ANCE-PRF**)
  - LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. - relevance feedback**)
  - Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. - DR)] (**CoRocchio, Counterfactual Rocchio algorithm**)
  - Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.
  - RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. - Index)] (**RepBERT**)
  - Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.
  - RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives, denoise hard negatives and data augementation**)
  - Optimizing Dense Retrieval Model Training with Hard Negatives. - side finetuning build on pretrained document encoders**)
  - Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. - hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
  - PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval - PAIR)] (**PAIR**)
  - Pre-training tasks for embedding-based large scale retrieval. - Cheng Chang et.al.* ICLR 2020. (**ICT, BFS and WLP**)
  - ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. - futuredata/ColBERT)] (**ColBERT**)
  - Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. - encoders**)
  - Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.
  - Learning Dense Representations of Phrases at Scale.
  - Multi-View Document Representation Learning for Open-Domain Dense Retrieval.
  - Multivariate Representation Learning for Information Retrieval.
  - Distilling Knowledge from Reader to Retriever for Question Answering. - retriever-pytorch)] (**Distill cross-attention of reader to retriever**)
  - Distilling Knowledge for Fast Retrieval-based Chat-bots. - encoders to bi-encoders**)
  - Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. - hofstaetter/neural-ranking-kd)] (**Distill from BERT ensemble**)
  - Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. - Chieh Lin, Jheng-Hong Yang, Jimmy Lin.* Arxiv 2020. [[code](https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md)] (**TCTColBERT: distill from ColBERT**)
  - REALM: Retrieval-Augmented Language Model Pre-Training. - research/language/blob/master/language/realm/README.md)] (**REALM**)
  - Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. - Encoder/)] (**Seed**)
  - Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. - lingual pre-training**)
  - Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.
  - LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.
  - A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. - based contrastive pretraining**)
  - Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - Ma/COSTA)](**COSTA, group-wise contrastive learning**)
  - Structure and Semantics Preserving Document Representations.
  - Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning.
  - Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.
  - Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.
  - Matchingoriented Embedding Quantization For Ad-hoc Retrieval.
  - Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings.
  - Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.
  - Evaluating Extrapolation Performance of Dense Retrieval. - eval)]
  - PseudoRelevance Feedback for Multiple Representation Dense Retrieval. - PRF**)
  - LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. - relevance feedback**)
  - Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. - DR)] (**CoRocchio, Counterfactual Rocchio algorithm**)
  - Latent Retrieval for Weakly Supervised Open Domain Question Answering. - research/language/blob/master/language/orqa/README.md)] (**ORQA, ICT**)
  - Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval.
  - H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search. - ERNIE**)
  - Sparse, Dense, and Attentional Representations for Text Retrieval. - BERT, multi-vectors**)
  - Dense Passage Retrieval for Open-Domain Question Answering. - batch negatives**)
  - Condenser: a Pre-training Architecture for Dense Retrieval.
  - Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance.
  - Multi-Task Retrieval for Knowledge-Intensive Tasks. - task learning**)
  - Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. - PRF)] (**ANCE-PRF**)
  - Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.
- Hybrid Retrieval
Re-ranking Stage
- Basic Usage
  - Understanding the Behaviors of BERT in Ranking. - focused and Interanction-focused**)
  - Passage Re-ranking with BERT. - dl/dl4marco-bert)] (**monoBERT: Maybe the first work on applying BERT to IR**)
  - Multi-Stage Document Ranking with BERT, - Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.](https://arxiv.org/pdf/2101.05667.pdf) *Rodrigo Nogueira et.al.* Arxiv 2020. (**Expando-Mono-Duo: doc2query+pointwise+pairwise**)
  - CEDR: Contextualized Embeddings for Document Ranking. - IR-Lab/cedr)] (**CEDR: BERT+neuIR model**)
  - Beyond [CLS
  - Document Ranking with a Pretrained Sequence-to-Sequence Model.
  - RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses.
  - Generalizing Discriminative Retrieval Models using Generative Tasks.
  - Understanding the Behaviors of BERT in Ranking. - focused and Interanction-focused**)
  - CEDR: Contextualized Embeddings for Document Ranking. - IR-Lab/cedr)] (**CEDR: BERT+neuIR model**)
  - Document Ranking with a Pretrained Sequence-to-Sequence Model.
  - RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses.
  - Passage Re-ranking with BERT. - dl/dl4marco-bert)] (**monoBERT: Maybe the first work on applying BERT to IR**)
  - Beyond [CLS
  - Multi-Stage Document Ranking with BERT, - Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.](https://arxiv.org/pdf/2101.05667.pdf) *Rodrigo Nogueira et.al.* Arxiv 2020. (**Expando-Mono-Duo: doc2query+pointwise+pairwise**)
- Long Document Processing Techniques
  - Deeper Text Understanding for IR with Contextual Neural Language Modeling. - BERT-IR)] (**BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level**)
  - Simple Applications of BERT for Ad Hoc Document Retrieval, - 3004.pdf) [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval.](https://www.aclweb.org/anthology/D19-1352.pdf) *Wei Yang, Haotian Zhang et.al.* Arxiv 2020, *Zeynep Akkalyoncu Yilmaz et.al.* EMNLP 2019 short. [[code](https://github.com/castorini/birch)] (**Birch: Sentence-level**)
  - Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. - hofstaetter/intra-document-cascade)] (**Distill a ranking model to conv-knrm to select top-k passages**)
  - PARADE: Passage Representation Aggregation for Document Reranking.
  - Local Self-Attention over Long Text for Efficient Document Retrieval. - hofstaetter/transformer-kernel-ranking)] (**TKL:Transformer-Kernel for long text**)
  - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. - research/google-research/tree/master/smith)] (**SMITH for doc2doc matching**)
  - Socialformer: Social Network Inspired Long Document Modeling for Document Ranking.
  - Deeper Text Understanding for IR with Contextual Neural Language Modeling. - BERT-IR)] (**BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level**)
  - Simple Applications of BERT for Ad Hoc Document Retrieval, - 3004.pdf) [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval.](https://www.aclweb.org/anthology/D19-1352.pdf) *Wei Yang, Haotian Zhang et.al.* Arxiv 2020, *Zeynep Akkalyoncu Yilmaz et.al.* EMNLP 2019 short. [[code](https://github.com/castorini/birch)] (**Birch: Sentence-level**)
  - Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. - hofstaetter/intra-document-cascade)] (**Distill a ranking model to conv-knrm to select top-k passages**)
  - PARADE: Passage Representation Aggregation for Document Reranking.
  - Local Self-Attention over Long Text for Efficient Document Retrieval. - hofstaetter/transformer-kernel-ranking)] (**TKL:Transformer-Kernel for long text**)
  - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. - research/google-research/tree/master/smith)] (**SMITH for doc2doc matching**)
  - Socialformer: Social Network Inspired Long Document Modeling for Document Ranking.
  - Leveraging Passage-level Cumulative Gain for Document Ranking.
- Improving Efficiency
  - DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. - BERT**)
  - Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. - IR-Lab/prettr-neural-ir)] (**PreTTR**)
  - Modularized Transfomer-based Ranking Framework.
  - Fast Forward Indexes for Efficient Document Ranking.
  - Understanding BERT Rankers Under Distillation.
  - Simplified TinyBERT: Knowledge Distillation for Document Retrieval. - unique/Simplified-TinyBERT)] (**TinyBERT+knowledge distillation**)
  - Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. - Tuning**)
  - Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval.
  - The Cascade Transformer: an Application for Efficient Answer Sentence Selection. - cascade-transformers)] (**Cascade Transformer: prune candidates by layer**)
  - Modularized Transfomer-based Ranking Framework.
  - Fast Forward Indexes for Efficient Document Ranking.
  - Understanding BERT Rankers Under Distillation.
  - DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. - BERT**)
  - Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. - IR-Lab/prettr-neural-ir)] (**PreTTR**)
  - Simplified TinyBERT: Knowledge Distillation for Document Retrieval. - unique/Simplified-TinyBERT)] (**TinyBERT+knowledge distillation**)
  - Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. - Tuning**)
  - Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval.
  - The Cascade Transformer: an Application for Efficient Answer Sentence Selection. - cascade-transformers)] (**Cascade Transformer: prune candidates by layer**)
  - TILDE: Term Independent Likelihood moDEl for Passage Re-ranking.
  - Early Exiting BERT for Efficient Document Ranking. - monobert)] (**Early exit**)
- Other Topics
  - BERT-QE: Contextualized Query Expansion for Document Re-ranking. - zheng/BERT-QE)] (**BERT-QE**)
  - Training Curricula for Open Domain Answer Re-Ranking. - IR-Lab/curricula-neural-ir)] (**curriculum learning based on BM25**)
  - Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models.
  - Selective Weak Supervision for Neural Information Retrieval.
  - PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**PROP**)
  - B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**B-PROP**)
  - Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need.
  - Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking.
  - Pre-trained Language Model based Ranking in Baidu Search.
  - A Unified Pretraining Framework for Passage Ranking and Expansion.
  - Axiomatically Regularized Pre-training for Ad hoc Search.
  - PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models.
  - Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models
  - Are Neural Ranking Models Robust?
  - Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models
  - Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. - An Liu et.al.* SIGIR 2023.
  - Cross-lingual Retrieval for Iterative Self-Supervised Training.
  - CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. - CLIRMatrix and multilingual BERT**)
  - BERT-QE: Contextualized Query Expansion for Document Re-ranking. - zheng/BERT-QE)] (**BERT-QE**)
  - Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models.
  - Selective Weak Supervision for Neural Information Retrieval.
  - Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking.
  - Pre-trained Language Model based Ranking in Baidu Search.
  - Are Neural Ranking Models Robust?
  - Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models
  - Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. - An Liu et.al.* SIGIR 2023.
  - Cross-lingual Retrieval for Iterative Self-Supervised Training.
  - MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval.
  - Cross-lingual Language Model Pretraining for Retrieval.
  - Webformer: Pre-training with Web Pages for Information Retrieval.
  - Competitive Search.
  - PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models.
  - Training Curricula for Open Domain Answer Re-Ranking. - IR-Lab/curricula-neural-ir)] (**curriculum learning based on BM25**)
  - CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. - CLIRMatrix and multilingual BERT**)
  - PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**PROP**)
  - B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. - Ma/PROP)] (**B-PROP**)
  - Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models
Jointly Learning Retrieval and Re-ranking
- Other Topics
Model-based IR System
- Other Topics
Other Resources
- Other Resources About Pre-trained Models in NLP
- Surveys About Efficient Transformers
  - Efficient Transformers: A Survey.
  - Efficient Transformers: A Survey.
- Some Retrieval Toolkits
  - Faiss: a library for efficient similarity search and clustering of dense vectors
  - MatchZoo: a library consisting of many popular neural text matching models

Programming Languages

Categories

First Stage Retrieval 118 LLM and IR 106 Re-ranking Stage 87 Multimodal Retrieval 23 Model-based IR System 20 Survey Papers 8 Other Resources 7 Jointly Learning Retrieval and Re-ranking 5

Sub Categories

LLM for IR 92 Dense Retrieval 87 Other Topics 62 Sparse Retrieval 24 Improving Efficiency 20 Long Document Processing Techniques 15 Basic Usage 15 Multi-stream Architecture Applied on Input 12 Unified Single-stream Architecture 11 Retrieval Augmented LLM 10 Hybrid Retrieval 7 Perspectives or Surveys 4 Other Resources About Pre-trained Models in NLP 3 Some Retrieval Toolkits 2 Surveys About Efficient Transformers 2

Keywords

text-matching 1 text 1 neural-network 1 natural-language-processing 1 matching 1 deep-learning 1