Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ict-bigdatalab/awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
https://github.com/ict-bigdatalab/awesome-pretrained-models-for-information-retrieval
List: awesome-pretrained-models-for-information-retrieval
bert-for-ir dense-retrieval information-retrieval pretrain-for-search pretrained-language-models pretraining-for-ir reranking web-search
Last synced: 2 months ago
JSON representation
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
- Host: GitHub
- URL: https://github.com/ict-bigdatalab/awesome-pretrained-models-for-information-retrieval
- Owner: ict-bigdatalab
- Created: 2020-11-25T02:53:47.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-01-07T09:06:09.000Z (about 1 year ago)
- Last Synced: 2024-11-07T23:02:42.731Z (2 months ago)
- Topics: bert-for-ir, dense-retrieval, information-retrieval, pretrain-for-search, pretrained-language-models, pretraining-for-ir, reranking, web-search
- Homepage:
- Size: 437 KB
- Stars: 643
- Watchers: 21
- Forks: 48
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-machine-learning-resources - **[List - Ma/awesome-pretrained-models-for-information-retrieval?style=social) (Table of Contents)
- ultimate-awesome - awesome-pretrained-models-for-information-retrieval - A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR). (Other Lists / Monkey C Lists)
README
# awesome-pretrained-models-for-information-retrieval
> A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., **pre-training for IR**). If I missed any papers, feel free to open a PR to include them! And any feedback and contributions are welcome!
## Pre-training for IR
- [Survey Papers](#survey-papers)
- [Phase 1: First-stage Retrieval](#first-stage-retrieval)
Sparse Retrieval
- [Neural term re-weighting](#neural-term-re-weighting)
- [Query or document expansion](#query-or-document-expansion)
- [Sparse representation learning](#sparse-representation-learning)
- [Hard negative sampling](#hard-negative-sampling)
- [Late interaction and multi-vector representation](#late-interaction-and-multi-vector-representation)
- [Knowledge distillation](#knowledge-distillation)
- [Pre-training tailored for dense retrieval](#pre-training-tailored-for-dense-retrieval)
- [Jointly learning retrieval and indexing](#jointly-learning-retrieval-and-indexing)
- [Domain adaptation](#domain-adaptation)
- [Query reformulation](#query-reformulation)
- [Bias](#bias)
- [Phase 2: Re-ranking Stage](#re-ranking-stage)
Basic Usage
- [Discriminative ranking models](#discriminative-ranking-models)
- [Generative ranking models](#generative-ranking-models)
- [Hybrid ranking models](#hybrid-ranking-models)
Long Document Processing Techniques
- [Passage score aggregation](#passage-score-aggregation)
- [Passage representation aggregation](#passage-representation-aggregation)
- [Designing new architectures](#designing-new-architectures)
- [Decoupling the interaction](#decoupling-the-interaction)
- [Knowledge distillation](#knowledge-distillation)
- [Partial Fine-tuning](#partial-fine-tuning)
- [Early exit](#early-exit)
- [Query Expansion](#query-expansion)
- [Re-weighting Training Samples](#re-weighting-training-samples)
- [Pre-training Tailored for Re-ranking](#pre-training-tailored-for-re-ranking)
- [Adversarial Attack and Defence](#adversarial-attack-and-defence)
- [Cross-lingual Retrieval](#cross-lingual-retrieval)
- [Jointly Learning Retrieval and Re-ranking](#jointly-learning-retrieval-and-re-ranking)
- [Model-based IR System](#model-based-ir-system)- [LLM and IR](#llm-and-ir)
Retrieval Augmented LLM
- [Perspectives or Surveys](#perspectives-or-surveys)
- [Synthetic Query Generation](#synthetic-query-generation)
- [Synthetic Document Generation](#synthetic-document-generation)
- [LLM for Relevance Scoring](#llm-for-relevance-scoring)
- [Text Generation based on IR](#text-generation-based-on-ir)
- [Others](#others)
- [Multimodal Retrieval](#multimodal-retrieval)
Unified Single-stream Architecture
Multi-stream Architecture Applied on Input
- [Other Resources](#other-resources)
## Survey Papers
- [Pre-training Methods in Information Retrieval.](https://arxiv.org/pdf/2111.13853.pdf) *Yixing Fan, Xiaohui Xie et.al.* FnTIR 2022
- [Dense Text Retrieval based on Pretrained Language Models: A Survey.](https://arxiv.org/pdf/2211.14876.pdf) *Wayne Xin Zhao, Jing Liu et.al.* Arxiv 2022
- [Pretrained Transformers for Text Ranking: BERT and Beyond.](https://arxiv.org/abs/2010.06467) *Jimmy Lin et.al.* M&C 2021
- [Semantic Models for the First-stage Retrieval: A Comprehensive Review.](https://arxiv.org/pdf/2103.04831.pdf) *Jiafeng Guo et.al.* TOIS 2021
- [A Deep Look into neural ranking models for information retrieval.](https://arxiv.org/abs/1903.06902) *Jiafeng Guo et.al.* IPM 2020## First Stage Retrieval
### Sparse Retrieval
#### Neural term re-weighting
- [Learning to Reweight Terms with Distributed Representations.](https://dl.acm.org/doi/pdf/10.1145/2766462.2767700) *Guoqing Zheng, Jamie Callan* SIGIR 2015.(**DeepTR**)
- [Context-Aware Term Weighting For First Stage Passage Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3397271.3401204) *Zhuyun Dai et.al.* SIGIR 2020 short. [[code](https://github.com/AdeDZY/DeepCT)] (**DeepCT**)
- [Context-Aware Document Term Weighting for Ad-Hoc Search.](https://dl.acm.org/doi/pdf/10.1145/3366423.3380258) *Zhuyun Dai et.al.* WWW 2020. [[code](https://github.com/AdeDZY/DeepCT/tree/master/HDCT)] (**HDCT**)
- [Learning Term Discrimination.](https://arxiv.org/pdf/2004.11759.pdf) *Jibril Frej et.al.* SIGIR 2020. (**IDF-reweighting**)
- [COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.](https://arxiv.org/pdf/2104.07186.pdf) *Luyu Gao et.al.* NAACL 2020. [[code](https://github.com/luyug/COIL)] (**COIL**)
- [Learning Passage Impacts for Inverted Indexes.](https://arxiv.org/pdf/2104.12016.pdf) *Antonio Mallia et.al.* SIGIR 2021 short. [[code](https://github.com/DI4IR/SIGIR2021)] (**DeepImapct**)#### Query or document expansion
- [Document Expansion by Query Prediction.](https://arxiv.org/pdf/1904.08375.pdf) *Rodrigo Nogueira et.al.* [[doc2query code](https://github.com/nyu-dl/dl4ir-doc2query), [docTTTTTquery code](https://github.com/castorini/docTTTTTquery)] (**doc2query, docTTTTTquery**)
- [Generation-Augmented Retrieval for Open-Domain Question Answering.](https://arxiv.org/pdf/2009.08553.pdf) *Yuning Mao et.al.* ACL 2021. [[code](https://github.com/morningmoni/GAR)] (**query expansion with BART**)
- [Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation.](https://arxiv.org/abs/2105.00666) *Jeong et.al.* arXiv 2021. [[code](https://github.com/starsuzi/UDEG)] (**unsupervised document expansion**)#### Sparse representation learning
- [SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.](https://arxiv.org/pdf/2010.00768.pdf) *Yang Bai, Xiaoguang Li et.al.* Arxiv 2020. (**SparTerm: Term importance distribution from MLM+Binary Term Gating**)
- [Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.](https://arxiv.org/pdf/1911.02896.pdf) *Jinhyuk Lee, Minjoon Seo et.al.* ACL 2020. [[code](https://github.com/jhyuklee/sparc)] (**SPARC, sparse vectors**)
- [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.](https://arxiv.org/pdf/2107.05720.pdf), and [v2.](https://arxiv.org/pdf/2109.10086.pdf) *Thibault Formal et.al.* SIGIR 2021. [[code](https://github.com/naver/splade)](**SPLADE**)
- [Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval.](https://arxiv.org/pdf/2104.07198.pdf) *Kyoung-Rok Jang et.al.* EMNLP 2021. (**UHD**)
- [Efficient Passage Retrieval with Hashing for Open-domain Question Answering.](https://arxiv.org/pdf/2106.00882.pdf) *Ikuya Yamada et.al.* ACL 2021. [[code](https://github.com/studio-ousia/bpr)] (**BPR, convert embedding vector to binary codes**)### Dense Retrieval
#### Hard negative sampling
- [Dense Passage Retrieval for Open-Domain Question Answering.](https://arxiv.org/pdf/2004.04906.pdf) *Vladimir Karpukhin,Barlas Oguz et.al.* EMNLP 2020 [[code](https://github.com/facebookresearch/DPR)] (**DPR, in-batch negatives**)
- [RepBERT: Contextualized Text Embeddings for First-Stage Retrieval.](https://arxiv.org/pdf/2006.15498.pdf) *Jingtao Zhan et.al.* Arxiv 2020. [[code](https://github.com/jingtaozhan/RepBERT-Index)] (**RepBERT**)
- [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.](https://arxiv.org/pdf/2007.00808.pdf) *Lee Xiong, Chenyan Xiong et.al.* [[code](https://github.com/microsoft/ANCE)] (**ANCE, refresh index during training**)
- [RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering.](https://arxiv.org/pdf/2010.08191.pdf) *Yingqi Qu et.al.* NAACL 2021. (**RocketQA: cross-batch negatives, denoise hard negatives and data augementation**)
- [Optimizing Dense Retrieval Model Training with Hard Negatives.](https://arxiv.org/pdf/2104.08051.pdf) *Jingtao Zhan et.al.* SIGIR 2021.[[code](https://github.com/jingtaozhan/DRhard)] (**ADORE&STAR, query-side finetuning build on pretrained document encoders**)
- [Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling.](https://arxiv.org/pdf/2104.06967.pdf) *Sebastian Hofstätter et.al.* SIGIR 2021.[[code](https://github.com/sebastian-hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
- [PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval](https://arxiv.org/pdf/2108.06027.pdf) *Ruiyang Ren et.al.* EMNLP Findings 2021. [[code](https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2021-PAIR)] (**PAIR**)#### Late interaction and multi-vector representation
- [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.](https://arxiv.org/pdf/2004.12832.pdf) *Omar Khattab et.al.* SIGIR 2020. [[code](https://github.com/stanford-futuredata/ColBERT)] (**ColBERT**)
- [Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring.](https://arxiv.org/pdf/1905.01969.pdf) *Samuel Humeau,Kurt Shuster et.al.* ICLR 2020. [[code](https://github.com/facebookresearch/ParlAI/tree/master/projects/polyencoder)] (**Poly-encoders**)
- [Sparse, Dense, and Attentional Representations for Text Retrieval.](https://arxiv.org/pdf/2005.00181.pdf) *Yi Luan, Jacob Eisenstein et.al.* TACL 2020. (**ME-BERT, multi-vectors**)
- [Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.](https://arxiv.org/pdf/2105.03599.pdf) *Hongyin Tang, Xingwu Sun et.al.* ACL 2021.
- [Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.](https://arxiv.org/pdf/1906.05807.pdf) *Minjoon Seo,Jinhyuk Lee et.al.* ACL 2019. [[code](https://github.com/uwnlp/denspi)] (**DENSPI**)
- [Learning Dense Representations of Phrases at Scale.](https://arxiv.org/pdf/2012.12624.pdf) *Jinhyuk Lee, Danqi Chen et.al.* ACL 2021. [[code](https://github.com/jhyuklee/DensePhrases)] (**DensePhrases**)
- [Multi-View Document Representation Learning for Open-Domain Dense Retrieval.](https://arxiv.org/pdf/2203.08372.pdf) *Shunyu Zhang et.al.* ACL 2022. (**MVR**)
- [Multivariate Representation Learning for Information Retrieval.](https://arxiv.org/pdf/2304.14522.pdf) *Hamed Zamani et.al.* SIGIR 2023. (**Learn multivariate distributions**)#### Knowledge distillation
- [Distilling Knowledge from Reader to Retriever for Question Answering.](https://arxiv.org/pdf/2012.04584.pdf) *Gautier Izacard, Edouard Grave.* ICLR 2020. [[unofficial code](https://github.com/lucidrains/distilled-retriever-pytorch)] (**Distill cross-attention of reader to retriever**)
- [Distilling Knowledge for Fast Retrieval-based Chat-bots.](https://arxiv.org/pdf/2004.11045.pdf) *Amir Vakili Tahami et.al.* SIGIR 2020. [[code](https://github.com/KamyarGhajar/DistilledNeuralResponseRanker)] (**Distill from cross-encoders to bi-encoders**)
- [Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation.](https://arxiv.org/pdf/2010.02666.pdf) *Sebastian Hofstätter et.al.* Arxiv 2020. [[code](https://github.com/sebastian-hofstaetter/neural-ranking-kd)] (**Distill from BERT ensemble**)
- [Distilling Dense Representations for Ranking using Tightly-Coupled Teachers.](https://arxiv.org/pdf/2010.11386.pdf) *Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin.* Arxiv 2020. [[code](https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md)] (**TCTColBERT: distill from ColBERT**)
- [Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling.](https://arxiv.org/pdf/2104.06967.pdf) *Sebastian Hofstätter et.al.* SIGIR 2021.[[code](https://github.com/sebastian-hofstaetter/tas-balanced-dense-retrieval)] (**TAS-Balanced, sample from query cluster and distill from BERT ensemble**)
- [RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.](https://arxiv.org/pdf/2110.07367.pdf) *Ruiyang Ren, Yingqi Qu et.al.* EMNLP 2021. [[code](https://github.com/PaddlePaddle/RocketQA)] (**RocketQAv2, joint learning by distillation**)
- [Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3477495.3531961) *Kelong Mao et.al.* SIGIR 2022.
#### Pre-training tailored for dense retrieval
- [Latent Retrieval for Weakly Supervised Open Domain Question Answering.](https://arxiv.org/pdf/1906.00300.pdf) *Kenton Lee et.al.* ACL 2019. [[code](https://github.com/google-research/language/blob/master/language/orqa/README.md)] (**ORQA, ICT**)
- [Pre-training tasks for embedding-based large scale retrieval.](https://arxiv.org/pdf/2002.03932.pdf) *Wei-Cheng Chang et.al.* ICLR 2020. (**ICT, BFS and WLP**)
- [REALM: Retrieval-Augmented Language Model Pre-Training.](https://arxiv.org/pdf/2002.08909.pdf) *Kelvin Guu, Kenton Lee et.al.* ICML 2020. [[code](https://github.com/google-research/language/blob/master/language/realm/README.md)] (**REALM**)
- [Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder.](https://arxiv.org/pdf/2102.09206.pdf) *Shuqi Lu, Di He, Chenyan Xiong et.al.* EMNLP 2021. [[code](https://github.com/microsoft/SEED-Encoder/)] (**Seed**)
- [Condenser: a Pre-training Architecture for Dense Retrieval.](https://arxiv.org/pdf/2104.08253.pdf) *Luyu Gao et.al.* EMNLP 2021. [[code](https://github.com/luyug/Condenser)](**Condenser**)
- [Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval.](https://arxiv.org/pdf/2206.03281.pdf) *Ning Wu et.al.* JICAI 2022. [[code](https://github.com/wuning0929/CCP_IJCAI22)](**CCP, cross-lingual pre-training**)
- [Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.](https://arxiv.org/pdf/2108.05540.pdf) *Luyu Gao et.al.* ACL 2022. [[code](https://github.com/luyug/Condenser)](**coCondenser**)
- [LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.](https://arxiv.org/pdf/2203.06169.pdf) *Canwen Xu, Daya Guo et.al.* ACL 2022. [[code](https://github.com/JetRunner/LaPraDoR)] (**LaPraDoR, ICT+dropout**)
- [A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval.](https://arxiv.org/pdf/2208.09846.pdf) *Xinyu Ma et.al.* CIKM 2022. (**CPADE, document term distribution-based contrastive pretraining**)
- [Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction](https://arxiv.org/pdf/2204.10641.pdf) *Xinyu Ma et.al.* SIGIR 2022. [[code](https://github.com/Albert-Ma/COSTA)](**COSTA, group-wise contrastive learning**)
- [H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search.](https://dl.acm.org/doi/pdf/10.1145/3477495.3531986) *Xiaokai Chu et.al.* SIGIR 2022. (**H-ERNIE**)
- [Structure and Semantics Preserving Document Representations.](https://arxiv.org/pdf/2201.03720.pdf) *Natraj Raman et.al.* SIGIR 2022.
- [Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning.](https://arxiv.org/pdf/2112.09118.pdf) *Gautier Izacard et.al.* TMLR 2022. [[code](https://github.com/facebookresearch/contriever)] (**Contriever**)
- [Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation.](https://arxiv.org/abs/2203.07735) *Jeong et.al.* ACL 2022. [[code](https://github.com/starsuzi/DAR)] (**Augmentation for Dense Retrieval**)#### Jointly learning retrieval and indexing
- [Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.](https://arxiv.org/pdf/2105.03933.pdf) *Han Zhang et.al.* SIGIR 2021 short. [[code](https://github.com/jdcomsearch/poeem)] (**Poeem**)
- [Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance.](https://arxiv.org/pdf/2108.00644.pdf) *Jingtao Zhan et.al.* CIKM 2021. [[code](https://github.com/jingtaozhan/JPQ)] (**JPQ**)
- [Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.](https://arxiv.org/pdf/2110.05789.pdf)*Jingtao Zhan et.al.* WSDM 2022. [[code](https://github.com/jingtaozhan/RepCONC)] (**RepCONC**)
- [Matchingoriented Embedding Quantization For Ad-hoc Retrieval.](https://arxiv.org/pdf/2104.07858.pdf) *Shitao Xiao et.al.* EMNLP 2021. [[code](https://github.com/microsoft/MoPQ)]
- [Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings.](https://arxiv.org/pdf/2204.00185.pdf) *Shitao Xiao et.al.* SIGIR 2022. [[code](https://github.com/staoxiao/LibVQ)]#### Multi-hop dense retrieval
- [Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.](https://arxiv.org/pdf/2009.12756.pdf) *Wenhan Xiong, Xiang Lorraine Li et.al.* ICLR 2021 [[code](https://github.com/facebookresearch/multihop_dense_retrieval)] (**Iteratively encode the question and previously retrieved documents as query vectors**)#### Domain adaptation
- [Multi-Task Retrieval for Knowledge-Intensive Tasks.](https://arxiv.org/pdf/2101.00117.pdf) *Jean Maillard, Vladimir Karpukhin^ et.al.* ACL 2021. (**Multi-task learning**)
- [Evaluating Extrapolation Performance of Dense Retrieval.](https://arxiv.org/pdf/2204.11447.pdf) *Jingtao Zhan et.al.* CIKM 2022. [[code](https://github.com/jingtaozhan/extrapolate-eval)]#### Query reformulation
- [PseudoRelevance Feedback for Multiple Representation Dense Retrieval.](https://arxiv.org/pdf/2106.11251.pdf) *Xiao Wang et.al.* ICTIR 2021 (**ColBERT-PRF**)
- [Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback.](https://arxiv.org/pdf/2108.13454.pdf) *HongChien Yu et.al.* CIKM 2021. [[code](https://github.com/yuhongqian/ANCE-PRF)] (**ANCE-PRF**)
- [LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback.](https://arxiv.org/pdf/2204.11545.pdf) *Yunchang Zhu et.al.* SIGIR 2022. [[code](https://github.com/zycdev/LoL)] (**LoL, Pseudo-relevance feedback**)#### Bias
- [Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach.](https://arxiv.org/pdf/2204.00718.pdf) *Shengyao Zhuang et.al.* SIGIR 2022. [[code](https://github.com/ielab/Counterfactual-DR)] (**CoRocchio, Counterfactual Rocchio algorithm**)
- [Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.](https://arxiv.org/pdf/2209.05072.pdf) *Yinqiong Cai et.al.* CIKM 2022.### Hybrid Retrieval
- [Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.](https://arxiv.org/pdf/1906.05807.pdf) *Minjoon Seo,Jinhyuk Lee et.al.* ACL 2019. [[code](https://github.com/uwnlp/denspi)] (**DENSPI**)
- [Complement Lexical Retrieval Model with Semantic Residual Embeddings.](https://arxiv.org/pdf/2004.13969.pdf) *Luyu Gao et.al.* ECIR 2021.
- [BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3471158.3472233) *Shuai Wang et.al.* ICTIR 2021.
- [Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.](https://arxiv.org/pdf/2201.05409.pdf) *Shitao Xiao et.al.* WWW 2022. [[code](https://github.com/microsoft/BiDR)]## Re-ranking Stage
### Basic Usage
#### Discriminative ranking models
##### Representation-focused
- [Understanding the Behaviors of BERT in Ranking.](https://arxiv.org/pdf/1904.07531.pdf) *Yifan Qiao et.al.* Aixiv 2019. (**Representation-focused and Interanction-focused**)##### Interanction-focused
- [Passage Re-ranking with BERT.](https://arxiv.org/pdf/1901.04085.pdf) *Rodrigo Nogueira et.al.* [[code](https://github.com/nyu-dl/dl4marco-bert)] (**monoBERT: Maybe the first work on applying BERT to IR**)
- [Multi-Stage Document Ranking with BERT,](https://arxiv.org/pdf/1910.14424.pdf) [The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.](https://arxiv.org/pdf/2101.05667.pdf) *Rodrigo Nogueira et.al.* Arxiv 2020. (**Expando-Mono-Duo: doc2query+pointwise+pairwise**)
- [CEDR: Contextualized Embeddings for Document Ranking.](https://arxiv.org/pdf/1904.07094.pdf) *Sean MacAvaney et.al.* SIGIR 2020 short. [[code](https://github.com/Georgetown-IR-Lab/cedr)] (**CEDR: BERT+neuIR model**)#### Generative ranking models
- [Beyond [CLS] through Ranking by Generation.](https://arxiv.org/pdf/2010.03073.pdf) *Cicero Nogueira dos Santos et.al.* EMNLP 2020 short. (**Query generation using GPT and BART**)
- [Document Ranking with a Pretrained Sequence-to-Sequence Model.](https://arxiv.org/pdf/2003.06713.pdf) *Rodrigo Nogueira, Zhiying Jiang et.al.* EMNLP 2020. [[code](https://github.com/castorini/pygaggle/)] (**Relevance token generation using T5**)
- [RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses.](https://arxiv.org/pdf/2210.10634.pdf) *Honglei Zhuang et.al.* Arxiv 2022.#### Hybrid ranking models
- [Generalizing Discriminative Retrieval Models using Generative Tasks.](https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1414) *Bingsheng Liu, Hamed Zamani et.al.* WWW 2021. (**GDMTL,joint discriminative and generative model with multitask learning**)### Long Document Processing Techniques
#### Passage score aggregation
- [Deeper Text Understanding for IR with Contextual Neural Language Modeling.](https://arxiv.org/pdf/1905.09217.pdf) *Zhuyun Dai et.al.* SIGIR 2020 short. [[code](https://github.com/AdeDZY/SIGIR19-BERT-IR)] (**BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level**)
- [Simple Applications of BERT for Ad Hoc Document Retrieval,](https://arxiv.org/pdf/1903.10972.pdf) [Applying BERT to Document Retrieval with Birch,](https://www.aclweb.org/anthology/D19-3004.pdf) [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval.](https://www.aclweb.org/anthology/D19-1352.pdf) *Wei Yang, Haotian Zhang et.al.* Arxiv 2020, *Zeynep Akkalyoncu Yilmaz et.al.* EMNLP 2019 short. [[code](https://github.com/castorini/birch)] (**Birch: Sentence-level**)
- [Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking.](https://arxiv.org/pdf/2105.09816.pdf) *Sebastian Hofstätter et.al.* SIGIR 2021. [[code](https://github.com/sebastian-hofstaetter/intra-document-cascade)] (**Distill a ranking model to conv-knrm to select top-k passages**)#### Passage representation aggregation
- [PARADE: Passage Representation Aggregation for Document Reranking.](https://arxiv.org/pdf/2008.09093.pdf) *Canjia Li et.al.* Arxiv 2020. [[code](https://github.com/canjiali/PARADE/)] (**An extensive comparison of various Passage Representation Aggregation methods**)
- [Leveraging Passage-level Cumulative Gain for Document Ranking.](https://dl.acm.org/doi/pdf/10.1145/3366423.3380305) *Zhijing Wu et.al.* WWW 2020. (**PCGM**)#### Designing new architectures
- [Local Self-Attention over Long Text for Efficient Document Retrieval.](https://arxiv.org/pdf/2005.04908.pdf) *Sebastian Hofstätter et.al.* SIGIR 2020 short. [[code](https://github.com/sebastian-hofstaetter/transformer-kernel-ranking)] (**TKL:Transformer-Kernel for long text**)
- [Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching.](https://arxiv.org/pdf/2004.12297v2.pdf) *Liu Yang et.al.* CIKM 2020. [[code](https://github.com/google-research/google-research/tree/master/smith)] (**SMITH for doc2doc matching**)
- [Socialformer: Social Network Inspired Long Document Modeling for Document Ranking.](https://arxiv.org/pdf/2202.10870.pdf) *Yujia Zhou et.al.* WWW 2022. (**Socialformer**)### Improving Efficiency
#### Decoupling the interaction
- [DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding.](https://arxiv.org/pdf/2002.12591.pdf) *Yuyu Zhang, Ping Nie et.al.* SIGIR 2020 short. (**DC-BERT**)
- [Efficient Document Re-Ranking for Transformers by Precomputing Term Representations.](https://arxiv.org/pdf/2004.14255.pdf) *Sean MacAvaney et.al.* SIGIR 2020. [[code](https://github.com/Georgetown-IR-Lab/prettr-neural-ir)] (**PreTTR**)
- [Modularized Transfomer-based Ranking Framework.](https://arxiv.org/pdf/2004.13313.pdf) *Luyu Gao et.al.* EMNLP 2020. [[code](https://github.com/luyug/MORES)] (**MORES, similar to PreTTR**)
- [TILDE: Term Independent Likelihood moDEl for Passage Re-ranking.](https://dl.acm.org/doi/pdf/10.1145/3404835.3462922) *Shengyao Zhuang, Guido Zuccon* SIGIR 2021. [[code](https://github.com/ielab/TILDE)] (**TILDE**)
- [Fast Forward Indexes for Efficient Document Ranking.](https://arxiv.org/pdf/2110.06051.pdf) *Jurek Leonhardt et.al.* WWW 2022. (**Fast forward index**)#### Knowledge distillation
- [Understanding BERT Rankers Under Distillation.](https://arxiv.org/pdf/2007.11088.pdf) *Luyu Gao et.al.* ICTIR 2020. (**LM Distill + Ranker Distill**)
- [Simplified TinyBERT: Knowledge Distillation for Document Retrieval.](https://arxiv.org/pdf/2009.07531.pdf) *Xuanang Chen et.al.* ECIR 2021. [[code](https://github.com/cxa-unique/Simplified-TinyBERT)] (**TinyBERT+knowledge distillation**)#### Partial Fine-tuning
- [Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning.](https://arxiv.org/pdf/2110.14943.pdf) *Euna Jung, Jaekeol Choi et.al.* WWW 2022. [[code](https://github.com/xlpczv/Semi_Siamese)] (**Lightweight Fine-Tuning**)
- [Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval.](https://arxiv.org/pdf/2208.09847.pdf) *Xinyu Ma et.al.* CIKM 2022.(**IAA, introduce the aside module to stabilize training**)#### Early exit
- [The Cascade Transformer: an Application for Efficient Answer Sentence Selection.](https://arxiv.org/pdf/2005.02534.pdf) *Luca Soldaini et.al.* ACL 2020.[[code](https://github.com/alexa/wqa-cascade-transformers)] (**Cascade Transformer: prune candidates by layer**)
- [Early Exiting BERT for Efficient Document Ranking.](https://www.aclweb.org/anthology/2020.sustainlp-1.11.pdf) *Ji Xin et.al.* EMNLP 2020 SustaiNLP Workshop. [[code](https://github.com/castorini/earlyexiting-monobert)] (**Early exit**)### Other Topics
#### Query Expansion
- [BERT-QE: Contextualized Query Expansion for Document Re-ranking.](https://arxiv.org/pdf/2009.07258.pdf) *Zhi Zheng et.al.* EMNLP 2020 Findings. [[code](https://github.com/zh-zheng/BERT-QE)] (**BERT-QE**)#### Re-weighting Training Samples
- [Training Curricula for Open Domain Answer Re-Ranking.](https://arxiv.org/pdf/2004.14269.pdf) *Sean MacAvaney et.al.* SIGIR 2020. [[code](https://github.com/Georgetown-IR-Lab/curricula-neural-ir)] (**curriculum learning based on BM25**)
- [Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models.](https://arxiv.org/pdf/2105.04651.pdf) *Daniel Cohen et.al.* SIGIR 2021.#### Pre-training Tailored for Re-ranking
- [MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3397271.3401194) *Lila Boualili et.al.* SIGIR 2020 short. [[code](https://github.com/BOUALILILila/markers_bert)] (**MarkedBERT**)
- [Selective Weak Supervision for Neural Information Retrieval.](https://arxiv.org/pdf/2001.10382.pdf) *Kaitao Zhang et.al.* WWW 2020. [[code](https://github.com/thunlp/ReInfoSelect)] (**ReInfoSelect**)
- [PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval.](https://arxiv.org/pdf/2010.10137.pdf) *Xinyu Ma et.al.* WSDM 2021. [[code](https://github.com/Albert-Ma/PROP)] (**PROP**)
- [Cross-lingual Language Model Pretraining for Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3442381.3449830) *Puxuan Yu et.al.* WWW 2021.
- [B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval.](https://arxiv.org/pdf/2104.09791.pdf) *Xinyu Ma et.al.* SIGIR 2021. [[code](https://github.com/Albert-Ma/PROP)] (**B-PROP**)
- [Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need.](https://arxiv.org/pdf/2108.09346.pdf) *Zhengyi Ma et.al.* CIKM 2021. [[code](https://github.com/zhengyima/Anchors)] (**HARP**)
- [Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking.](https://arxiv.org/pdf/2108.10510.pdf) *Yutao Zhu et.al.* CIKM 2021. [[code](https://github.com/DaoD/COCA)](**COCA**)
- [Pre-trained Language Model based Ranking in Baidu Search.](https://arxiv.org/pdf/2105.11108.pdf) *Lixin Zou et.al.* KDD 2021.
- [A Unified Pretraining Framework for Passage Ranking and Expansion.](https://ojs.aaai.org/index.php/AAAI/article/view/16584) *Ming Yan et.al.* AAAI 2021. (**UED, jointly training ranking and query generation**)
- [Axiomatically Regularized Pre-training for Ad hoc Search.](https://xuanyuan14.github.io/files/SIGIR22Chen.pdf) *Jia Chen et.al.* SIGIR 2022. [[code](https://github.com/xuanyuan14/ARES)] (**ARES**)
- [Webformer: Pre-training with Web Pages for Information Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3477495.3532086) *Yu Guo et.al.* SIGIR 2022. (**Webformer**)#### Adversarial Attack and Defence
- [Competitive Search.](https://dl.acm.org/doi/pdf/10.1145/3477495.3532771) *Oren Kurland et.al.* SIGIR 2022.
- [PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models.](https://arxiv.org/pdf/2204.01321) *Chen Wu et.al.* Arxiv 2022
- [Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models](https://arxiv.org/pdf/2209.06506.pdf) *Jiawei Liu et.al.* CCS 2022
- [Are Neural Ranking Models Robust?](https://arxiv.org/pdf/2108.05018.pdf) *Chen Wu et.al.* TOIS
- [Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models](https://arxiv.org/pdf/2209.06691.pdf) *Chen Wu et.al.* CIKM 2022
- [Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models.](https://arxiv.org/pdf/2304.14867.pdf) *Yu-An Liu et.al.* SIGIR 2023.#### Cross-lingual Retrieval
- [Cross-lingual Retrieval for Iterative Self-Supervised Training.](https://arxiv.org/pdf/2006.09526.pdf) *Chau Tran et.al.* NIPS 2020. [[code](https://github.com/pytorch/fairseq/tree/master/examples/criss)] (**CRISS**)
- [CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval.](https://www.aclweb.org/anthology/2020.emnlp-main.340.pdf) *Shuo Sun et.al.* EMNLP 2020. [[code](https://github.com/ssun32/CLIRMatrix)] (**Multilingual dataset-CLIRMatrix and multilingual BERT**)## Jointly Learning Retrieval and Re-ranking
- [RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.](https://arxiv.org/pdf/2110.07367.pdf) *Ruiyang Ren, Yingqi Qu et.al.* EMNLP 2021. [[code](https://github.com/PaddlePaddle/RocketQA)] (**RocketQAv2**)
- [Adversarial Retriever-Ranker for dense text retrieval.](https://arxiv.org/pdf/2110.03611.pdf) *Hang Zhang et.al.* ICLR 2022. [[code](https://github.com/microsoft/AR2)] (**AR2**)
- [RankFlow: Joint Optimization of Multi-Stage Cascade Ranking Systems as Flows.](https://dl.acm.org/doi/pdf/10.1145/3477495.3532050) *Jiarui Qin et.al.* SIGIR 2022. (**RankFlow**)## Model-based IR System
- [Rethinking Search: Making Domain Experts out of Dilettantes.](https://arxiv.org/pdf/2105.02274.pdf) *Donald Metzler et.al.* SIGIR Forum 2020.
(**Envisioned the model-based IR system**)
- [Transformer Memory as a Differentiable Search Index.](https://arxiv.org/pdf/2202.06991.pdf) *Yi Tay et.al.* Arxiv 2022.
(**DSI**)
- [DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index.](https://arxiv.org/pdf/2203.00537.pdf) *Yujia Zhou et.al.* Arxiv 2022.
(**DynamicRetriever**)
- [A Neural Corpus Indexer for Document Retrieval.](https://arxiv.org/pdf/2206.02743.pdf) *Yujing Wang et.al.* Arxiv 2022. (**NCI**)
- [Autoregressive Search Engines: Generating Substrings as Document Identifiers.](https://arxiv.org/pdf/2204.10628.pdf) *Michele Bevilacqua et.al.* Arxiv 2022. [[code](https://github.com/facebookresearch/SEAL)] (**SEAL**)
- [CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks.](https://arxiv.org/pdf/2208.07652.pdf) *Jiangui Chen et.al.* CIKM 2022. [[code](https://github.com/ict-bigdatalab/CorpusBrain)] (**CorpusBrain**)
- [A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning.](https://arxiv.org/pdf/2304.14856.pdf) *Jiangui Chen et.al.* SIGIR 2023. [[code](https://github.com/ict-bigdatalab/UGR)] (**UGR**)
- [TOME: A Two-stage Approach for Model-based Retrieval.](https://arxiv.org/pdf/2305.11161.pdf) *Ruiyang Ren et.al.* ACL 2023. (**TOME: Passage generation then URL generation**)
- [How Does Generative Retrieval Scale to Millions of Passages?](https://arxiv.org/pdf/2305.11841.pdf) *Ronak Pradeep, Kai Hui et.al.* Arxiv 2023. (**Comprehensive study on proposed methods, using synthetic queries as document ids**)
- [Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies.](https://arxiv.org/pdf/2305.15115.pdf) *Yubao Tang et.al.* KDD 2023. (**Semantic-Enhanced DSI**)## LLM and IR
### Perspectives or Surveys
- [Information Retrieval meets Large Language Models: A strategic report from Chinese IR community.](https://arxiv.org/pdf/2307.09751.pdf) *Qingyao AI et.al.* The CCIR community. AI Open 2023.
- [Large Language Models for Information Retrieval: A Survey.](https://arxiv.org/pdf/2308.07107.pdf) *Yutao Zhu et.al.* Renmin University of China. Arxiv 2023.
- [Navigating Complex Search Tasks with AI Copilots.](https://arxiv.org/pdf/2005.11401.pdf) *Ryen W. White* Microsoft Research. Arxiv 2023.### Retrieval Augmented LLM
- [Retrieval-augmented generation for knowledge-intensive NLP tasks.](https://arxiv.org/pdf/2005.11401.pdf) *Patrick Lewis, Ethan Perez et.al.* NIPS 2020. (**RAG, for 440M BART**)
- [Improving Language Models by Retrieving from Trillions of Tokens.](https://arxiv.org/pdf/2112.04426.pdf) *Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann et.al.* ICML 2022. [[code](https://github.com/facebookresearch/FiD)](***RETRO, enc-dec 7.5B**)
- [Atlas: Few-shot Learning with Retrieval Augmented Language Models.](https://arxiv.org/pdf/2208.03299.pdf) *Gautier Izacard, Patrick Lewis et.al.* Arxiv 2022. [[code](https://github.com/facebookresearch/atlas)] (**Atlas, T5, 11B**)
- [Internet-augmented language models through few-shot prompting for open-domain question answering.](https://arxiv.org/pdf/2203.05115.pdf) *Angeliki Lazaridou et.al.* Arxiv 2022. (**Gopher 280B, Conditioning on Google search results**)
- [Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy.](https://arxiv.org/pdf/2305.15294.pdf) *Zhihong Shao et.al.* Arxiv 2023.
- [Instruction Tuning post Retrieval-Augmented Pretraining.](https://arxiv.org/pdf/2310.07713.pdf) *Boxin Wang et.al.* Arxiv 2023.
- [Retrieve Anything To Augment Large Language Models.](https://arxiv.org/pdf/2310.07554.pdf)### LLM for IR
#### Synthetic Query Generation
- [Improving Passage Retrieval with Zero-Shot Question Generation.](https://arxiv.org/pdf/2204.07496.pdf) *Devendra Singh Sachan et.al.* EMNLP 2022. [[code](https://github.com/DevSinghSachan/unsupervised-passage-reranking)](**UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B**)
- [Promptagator: Few-shot Dense Retrieval From 8 Examples.](https://arxiv.org/pdf/2209.11755.pdf) *Zhuyun Dai et.al.* ICLR 2023. (**Generate pseudo queries using in-context learning, FLAN 137B**)
- [UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers.](https://arxiv.org/pdf/2303.00807.pdf) *Jon Saad-Falcon, Omar Khattab et.al.* Arxiv 2023. [[code](https://github.com/primeqa/primeqa)](**Train reranker with generated pseudo quereis with GPT3**)
- [InPars: Data Augmentation for Information Retrieval using Large Language Models.](https://arxiv.org/pdf/2202.05144.pdf) *Luiz Bonifacio et.al.* Arxiv 2022. [[code](https://github.com/zetaalphavector/InPars/tree/master/legacy/inpars-v1)](**Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs**)
- [InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval.](https://arxiv.org/pdf/2301.01820.pdf) *Vitor Jeronymo et.al.* Arxiv 2023. [[code](https://github.com/zetaalphavector/inPars/tree/master/legacy/inpars-v2)](**silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector**)
- [InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers.](https://arxiv.org/pdf/2301.02998.pdf) *Leonid Boytsov et.al.* Arxiv 2023. (**silimar to InPars, use GPT-J 6B and BLOOM 7B**)
- [Generative Relevance Feedback with Large Language Models.](https://arxiv.org/pdf/2304.13157.pdf) *Iain Mackie et.al.* SIGIR 2023 short. (**GRF, generate various info with GPT3 for relevance feedback**)
- [Query Expansion by Prompting Large Language Models.](https://arxiv.org/pdf/2305.03653.pdf) *Rolf Jagerman et.al.* Arxiv 2023.
- [Exploring the Viability of Synthetic Query Generation for Relevance Prediction.](https://arxiv.org/pdf/2305.11944.pdf) *Aditi Chaudhary et.al.* Arxiv 2023. (**FLAN-137B label conditioned generation**)
- [Large Language Model based Long-tail Query Rewriting in Taobao Search.](https://arxiv.org/pdf/2311.03758.pdf) *Wenjun Peng et.al.* Arxiv 2023.
- [Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers.](https://arxiv.org/pdf/2311.09175.pdf) *Minghan Li et.al.* Arxiv 2023. (**Use Flan-PaLM2-S for keywords generation**)
- [Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval.](https://arxiv.org/pdf/2311.05800.pdf) *Nandan Thakur et.al.* Arxiv 2023.#### Synthetic Document Generation
- [Generate rather than Retrieve: Large Language Models are Strong Context Generators.](https://arxiv.org/pdf/2209.10063.pdf) *Wenhao Yu et.al.* ICLR 2023. [[code](https://github.com/wyu97/GenRead)] (**GenRead,generate pseudo doc with InstructGPT for reader**)
- [Recitation-Augmented Language Models.](https://arxiv.org/pdf/2210.01296.pdf) *Zhiqing Sun et.al.* ICLR 2023. [[code](https://github.com/Edward-Sun/RECITE)] (**similar to GenRead**)
- [Precise Zero-Shot Dense Retrieval without Relevance Labels.](https://arxiv.org/pdf/2212.10496.pdf) *Luyu Gao, Xueguang Ma et.al.* Arxiv 2022. [[code](https://github.com/texttron/hyde)] (**HyDE,InstructGPT generate pseudo doc and Contriever retireve the real one**)
- [Query2doc: Query Expansion with Large Language Models.](https://arxiv.org/pdf/2303.07678.pdf) *Liang Wang et.al.* Arxiv 2023. (**Generate pseudo docs using in-context learning and then concat with queries, text-davinci-003**)
- [Large Language Models are Strong Zero-Shot Retriever.](https://arxiv.org/pdf/2304.14233.pdf) *Tao Shen et.al.* Arxiv 2023. (**similar to Hyde, augment the LLM with retrieved docs using BM25**)
- [Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts.](https://arxiv.org/pdf/2305.02320.pdf) *Arian Askari et.al.* Arxiv 2023. [[code](https://github.com/arian-askari/ChatGPT-RetrievalQA)] (**Ranking with synthetic data generated by ChatGPT**)#### LLM for Relevance Scoring
- [Task-aware Retrieval with Instructions.](https://arxiv.org/pdf/2211.09260.pdf) *Akari Asai, Timo Schick et.al.* Arxiv 2022. [[code](https://github.com/facebookresearch/tart)] (**TART, BERRI 40 tasks with instructions,1.5B FLAN-T5**)
- [One Embedder, Any Task: Instruction-Finetuned Text Embeddings.](https://arxiv.org/pdf/2212.09741.pdf) *Hongjin Su, Weijia Shi et.al.* [[code](https://github.com/HKUNLP/instructor-embedding)](**Intructor, 330 diverse tasks, 1.5B model**)
- [ExaRanker: Explanation-Augmented Neural Ranker.](https://arxiv.org/pdf/2301.10521.pdf) *Fernando Ferraretto et.al.* Arxiv 2023. [[code](https://github.com/unicamp-dl/ExaRanker)] (**Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002)**)
- [Perspectives on Large Language Models for Relevance Judgment.](https://arxiv.org/pdf/2304.09161.pdf) *Guglielmo Faggioli et.al.* Arxiv 2023. (**Perspective Paper**)
- [Zero-Shot Listwise Document Reranking with a Large Language Model.](https://arxiv.org/pdf/2305.02156.pdf) *Xueguang Ma et.al.* Arxiv 2023. (**LRL, generate rank list with GPT3**)
- [Large Language Models are Built-in Autoregressive Search Engines.](https://arxiv.org/pdf/2305.09612.pdf) *Noah Ziems et.al.* Arxiv 2023. (**LLM-URL, use GPT-3 text-davinci-003 to generate URL, model-based IR**)
- [Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent.](https://arxiv.org/pdf/2304.09542.pdf) *Weiwei Sun et.al.* EMNLP main 2023.[[code](https://github.com/sunnweiwei/RankGPT)](**Zero-shot Passage reranking with ChatGPT/GPT4**)
- [Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.](https://arxiv.org/pdf/2306.17563.pdf) *Zhen Qin et.al.* Arxiv 2023.
- [RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models.](https://arxiv.org/pdf/2309.15088.pdf) *Ronak Pradeep et.al.* Arxiv 2023. [[code](https://github.com/castorini/rank_llm)]
- [Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models.](https://arxiv.org/pdf/2310.07712.pdf) *Raphael Tang, Xinyu Zhang et.al.* Arxiv 2023. [[code](https://github.com/castorini/perm-sc)]
- [Fine-Tuning LLaMA for Multi-Stage Text Retrieval.](https://arxiv.org/pdf/2310.08319.pdf) *Xueguang Ma et.al.* Arxiv 2023.
- [A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models.](https://arxiv.org/pdf/2310.09497.pdf) *Shengyao Zhuang et.al.* Arxiv 2023.
- [Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking.](https://arxiv.org/pdf/2310.13243.pdf) *Shengyao Zhuang et.al.* Arxiv 2023. [[code](https://github.com/ielab/llm-qlm)]
- [PaRaDe: Passage Ranking using Demonstrations with Large Language Models.](https://arxiv.org/pdf/2310.14408.pdf) *Andrew Drozdov et.al.* Arxiv 2023.
- [Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels.](https://arxiv.org/pdf/2310.14122.pdf) *Honglei Zhuang et.al.* Arxiv 2023.
- [Large Language Models can Accurately Predict Searcher Preferences.](https://arxiv.org/pdf/2309.10621.pdf) *Paul Thomas et.al.* Arxiv 2023.
- [RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!](https://arxiv.org/pdf/2312.02724.pdf) *Ronak Pradeep et.al.* Arxiv 2023.
- [Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models.](https://arxiv.org/pdf/2312.02969.pdf) *Xinyu Zhang et.al.* Arxiv 2023.#### LLM for Generative Retrieval
- [ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models.](https://arxiv.org/pdf/2311.08593.pdf) *Haoxin Li et.al.* Arxiv 2023. (**Using GPT-3.5 generate keyphrases**)#### Retrieval-Augmented Text Generation
- [WebGPT: Browser-assisted question-answering with human feedback.](https://arxiv.org/pdf/2112.09332.pdf) *Reiichiro Nakano,Jacob Hilton,Suchir Balaji et.al.* Arxiv 2022. (**WebGPT, GPT3**)
- [Teaching language models to support answers with verified quotes.](https://arxiv.org/pdf/2203.11147.pdf) *DeepMind* Arxiv 2022.
- [Evaluating Verifiability in Generative Search Engines.](https://arxiv.org/pdf/2304.09848.pdf) *Nelson F. Liu et.al.* Arxiv 2023. [[code](https://github.com/nelson-liu/evaluating-verifiability-in-generative-search-engines)]
- [Enabling Large Language Models to Generate Text with Citations.](https://arxiv.org/pdf/2305.14627.pdf) *Tianyu Gao et.al.* Arxiv 2023. [[code](https://github.com/princeton-nlp/ALCE)] (**ALCE benchmark**)
- [FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.](https://arxiv.org/pdf/2310.03214.pdf) *Tu Vu et.al.* Arxiv 2023. [[code](https://github.com/freshllms/freshqa)]
- [Retrieve Anything To Augment Large Language Models.](https://arxiv.org/pdf/2310.07554.pdf) *Peitian Zhang, Shitao Xiao et.al.* Arxiv 2023. [[code](https://github.com/FlagOpen/FlagEmbedding)]
- [Leveraging Event Schema to Ask Clarifying Questions for
Conversational Legal Case Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3583780.3614953) *Bulou Liu et.al.* CIKM 2023.
- [Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher.](https://arxiv.org/pdf/2310.12443.pdf) *Xiang Shi et.al.*
- [Evaluating Generative Ad Hoc Information Retrieval.](https://arxiv.org/pdf/2311.04694.pdf) *Lukas Gienapp et.al.* Arxiv 2023.#### Others
- [Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.](https://arxiv.org/pdf/2212.14024.pdf) *Omar Khattab et.al.* Arxiv 2023.[[code](https://github.com/stanfordnlp/dsp)](**DSP program, GPT3.5**)## Multimodal Retrieval
### Unified Single-stream Architecture
- [Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training.](https://arxiv.org/pdf/1908.06066.pdf) *Gen Li, Nan Duan et.al.* AAAI 2020. [[code](https://github.com/microsoft/Unicoder)] (**Unicoder-VL**)
- [XGPT: Cross-modal Generative Pre-Training for Image Captioning.](https://arxiv.org/pdf/2003.01473.pdf) *Qiaolin Xia, Haoyang Huang, Nan Duan et.al.* Arxiv 2020. [[code](https://github.com/microsoft/Unicoder)] (**XGPT**)
- [UNITER: UNiversal Image-TExt Representation Learning.](https://arxiv.org/pdf/1909.11740.pdf) *Yen-Chun Chen, Linjie Li et.al.* ECCV 2020. [[code](https://github.com/ChenRocks/UNITER)] (**UNITER**)
- [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.](https://arxiv.org/pdf/2004.06165.pdf) *Xiujun Li, Xi Yin et.al.* ECCV 2020. [[code](https://github.com/microsoft/Oscar)] (**Oscar**)
- [VinVL: Making Visual Representations Matter in Vision-Language Models.](https://arxiv.org/pdf/2101.00529.pdf) *Pengchuan Zhang, Xiujun Li et.al.* ECCV 2020. [[code](https://github.com/microsoft/Oscar)] (**VinVL**)
- [Dynamic Modality Interaction Modeling for Image-Text Retrieval.](https://dl.acm.org/doi/pdf/10.1145/3404835.3462829) *Leigang Qu et.al.* SIGIR 2021 **Best student paper**. [[code](https://sigir21.wixsite.com/dime)] (**DIME**)### Multi-stream Architecture Applied on Input
- [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.](https://arxiv.org/pdf/1908.02265.pdf) *Jiasen Lu, Dhruv Batra et.al.* NeurIPS 2019. [[code](https://github.com/facebookresearch/vilbert-multi-task)] (**VilBERT**)
- [12-in-1: Multi-Task Vision and Language Representation Learning.](https://arxiv.org/pdf/1912.02315.pdf) *Jiasen Lu, Dhruv Batra et.al.* CVPR 2020. [[code](https://github.com/facebookresearch/vilbert-multi-task)] (**A multi-task model based on VilBERT**)
- [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/pdf/2103.00020.pdf) *Alec Radford et.al.* CVPR 2020. [[code](https://github.com/OpenAI/CLIP)] (**CLIP, GPT team**)
- [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph.](https://arxiv.org/pdf/2006.16934.pdf) *Fei Yu, Jiji Tang et.al.* Arxiv 2020. [[code](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil)] (**ERNIE-ViL,1st place on the VCR leaderboard**)
- [M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining.](https://arxiv.org/pdf/2003.13198.pdf) *Junyang Lin, An Yang et.al.* KDD 2020. (**M6-v0/InterBERT**)
- [M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training.](https://arxiv.org/pdf/2006.02635.pdf) *Haoyang Huang, Lin Su et.al.* CVPR 2021. [[code](https://github.com/microsoft/M3P)] (**M3P, MILD dataset**)## Other Resources
### Some Retrieval Toolkits
- [Faiss: a library for efficient similarity search and clustering of dense vectors](https://github.com/facebookresearch/faiss)
- [Pyserini: a Python Toolkit to Support Sparse and Dense Representations](https://github.com/castorini/pyserini/)
- [MatchZoo: a library consisting of many popular neural text matching models](https://github.com/NTMC-Community/MatchZoo)### Other Resources About Pre-trained Models in NLP
- [Pre-trained Models for Natural Language Processing: A Survey.](https://arxiv.org/abs/2003.08271) *Xipeng Qiu et.al.*
- [BERT-related-papers](https://github.com/tomohideshibata/BERT-related-papers)
- [Pre-trained Languge Model Papers from THU-NLP](https://github.com/thunlp/PLMpapers)### Surveys About Efficient Transformers
- [Efficient Transformers: A Survey.](https://arxiv.org/pdf/2009.06732.pdf) *Yi Tay, Mostafa Dehghani et.al.* Arxiv 2020.