Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/Agrover112/awesome-semantic-search

A curated list of awesome resources related to Semantic SearchπŸ”Ž and Semantic Similarity tasks.
https://github.com/Agrover112/awesome-semantic-search

List: awesome-semantic-search

awesome awesome-list hacktoberfest information-retrieval information-retrival nlp ranking semantic-search semantic-similarity sentence-embeddings

Last synced: about 2 months ago
JSON representation

A curated list of awesome resources related to Semantic SearchπŸ”Ž and Semantic Similarity tasks.

Lists

README

        

# Awesome Semantic-Search [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) [![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-yellow.svg)](https://conventionalcommits.org)

Logo made by [@createdbytango](https://instagram.com/createdbytango).

**Looking for More Paper Additions.
PS: Raise a PR**

Following repository aims to serve a meta-repository for [Semantic Search](https://en.wikipedia.org/wiki/Semantic_search) and [Semantic Similarity](http://nlpprogress.com/english/semantic_textual_similarity.html) related tasks.

Semantic Search isn't limited to text! It can be done with images, speech, etc.There are numerous different use-cases and applications of semantic search.

Feel free to raise a PR on this repo!

## Contents

- [Papers](#papers)
- [2014](#2014)
- [2015](#2015)
- [2016](#2016)
- [2017](#2017)
- [2018](#2018)
- [2019](#2019)
- [2020](#2020)
- [2021](#2021)
- [2022](#2022)
- [2023](#2023)
- [Articles](#articles)
- [Libraries and Tools](#libraries-and-tools)
- [Datasets](#datasets)
- [Milestones](#milestones)

## Papers

### 2010
- [Priority Range Trees](https://arxiv.org/abs/1009.3527)
- [Information Retrieval and the semantic web](https://ieeexplore.ieee.org/document/5607549) πŸ“„

### 2014
- [A Latent Semantic Model with Convolutional-Pooling
Structure for Information Retrieval](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) πŸ“„

### 2015
- [Skip-Thought Vectors](https://arxiv.org/pdf/1506.06726.pdf) πŸ“„
- [Practical and Optimal LSH for Angular Distance](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html)

### 2016
- [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759) πŸ“„
- [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) πŸ“„
- [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/abs/1603.09320)
- [On Approximately Searching for Similar Word Embeddings](https://www.aclweb.org/anthology/P16-1214.pdf)
- [Learning Distributed Representations of Sentences from Unlabelled Data](https://arxiv.org/abs/1602.03483)πŸ“„
- [Approximate Nearest Neighbor Search on High Dimensional Data --- Experiments, Analyses, and Improvement](https://arxiv.org/abs/1610.02455)

### 2017
- [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) πŸ“„
- [Semantic Textual Similarity For Hindi](https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b)πŸ“„
- [Efficient Natural Language Response Suggestion for Smart Reply](https://arxiv.org/abs/1705.00652)πŸ“ƒ

### 2018
- [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) πŸ“„
- [Learning Semantic Textual Similarity from Conversations](https://arxiv.org/pdf/1804.07754.pdf) πŸ“„
- [Google AI Blog: Advances in Semantic Textual Similarity](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) πŸ“„
- [Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech](https://arxiv.org/abs/1803.08976))πŸ”Š
- [Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data](https://arxiv.org/abs/1810.07355) πŸ”Š
- [Fast Approximate Nearest Neighbor Search With The
Navigating Spreading-out Graph](http://www.vldb.org/pvldb/vol12/p461-fu.pdf)
- [The Case for Learned Index Structures](https://dl.acm.org/doi/10.1145/3183713.3196909)

### 2019
- [LASER: Language Agnostic Sentence Representations](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) πŸ“„
- [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375) πŸ“„
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf) πŸ“„
- [Multi-Stage Document Ranking with BERT](https://arxiv.org/abs/1910.14424) πŸ“„
- [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/abs/1906.00300)
- [End-to-End Open-Domain Question Answering with BERTserini](https://www.aclweb.org/anthology/N19-4013/)
- [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)πŸ“„
- [Analyzing and Improving Representations with the Soft Nearest Neighbor Loss](https://arxiv.org/pdf/1902.01889.pdf)πŸ“·
- [DiskANN: Fast Accurate Billion-point Nearest
Neighbor Search on a Single Node](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)

### 2020
- [Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned](https://arxiv.org/abs/2004.05125) πŸ“„
- [PASSAGE RE-RANKING WITH BERT](https://arxiv.org/pdf/1901.04085.pdf) πŸ“„
- [CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization](https://arxiv.org/pdf/2006.09595.pdf) πŸ“„
- [LaBSE:Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) πŸ“„
- [Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset](https://arxiv.org/abs/2007.07846) πŸ“„
- [DeText: A deep NLP framework for intelligent text understanding](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) πŸ“„
- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf) πŸ“„
- [Pretrained Transformers for Text Ranking: BERT and Beyond](https://arxiv.org/abs/2010.06467) πŸ“„
- [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
- [ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS](https://openreview.net/pdf?id=r1xMH1BtvB)πŸ“„
- [Improving Deep Learning For Airbnb Search](https://arxiv.org/pdf/2002.05515)
- [Managing Diversity in Airbnb Search](https://arxiv.org/abs/2004.02621)πŸ“„
- [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval](https://arxiv.org/abs/2007.00808v1)πŸ“„
- [Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)πŸ“·
- [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659)πŸ“„

### 2021
- [Hybrid approach for semantic similarity calculation between Tamil words](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) πŸ“„
- [Augmented SBERT](https://arxiv.org/pdf/2010.08240.pdf) πŸ“„
- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) πŸ“„
- [Compatibility-aware Heterogeneous Visual Search](https://arxiv.org/abs/2105.06047) πŸ“·
- [Learning Personal Style from Few Examples](https://chuanenlin.com/personalstyle)πŸ“·
- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979)πŸ“„
- [A Survey of Transformers](https://arxiv.org/abs/2106.04554)πŸ“„πŸ“·
- [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://dl.acm.org/doi/10.1145/3404835.3463098)πŸ“„
- [High Quality Related Search Query Suggestions using Deep Reinforcement Learning](https://arxiv.org/abs/2108.04452v1)
- [Embedding-based Product Retrieval in Taobao Search](https://arxiv.org/pdf/2106.09297.pdf)πŸ“„πŸ“·
- [TPRM: A Topic-based Personalized Ranking Model for Web Search](https://arxiv.org/abs/2108.06014)πŸ“„
- [mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset](https://arxiv.org/abs/2108.13897)πŸ“„
- [Database Reasoning Over Text](https://aclanthology.org/2021.acl-long.241.pdf)πŸ“„
- [How Does Adversarial Fine-Tuning Benefit BERT?](https://arxiv.org/abs/2108.13602))πŸ“„
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)πŸ“„
- [Primer: Searching for Efficient Transformers for Language Modeling](https://arxiv.org/abs/2109.08668)πŸ“„
- [How Familiar Does That Sound? Cross-Lingual Representational
Similarity Analysis of Acoustic Word Embeddings](https://arxiv.org/pdf/2109.10179.pdf)πŸ”Š
- [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821#)πŸ“„
- [Compositional Attention: Disentangling Search and Retrieval](https://arxiv.org/abs/2110.09419)πŸ“„πŸ“·
- [SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search](https://arxiv.org/abs/2111.08566)
- [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577) πŸ“„
- [Generative Search Engines: Initial Experiments](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) πŸ“·
- [Rethinking Search: Making Domain Experts out of Dilettantes](https://dl.acm.org/doi/10.1145/3476415.3476428)
-[WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach](https://arxiv.org/abs/2104.01767)

### 2022
- [Text and Code Embeddings by Contrastive Pre-Training](https://arxiv.org/abs/2201.10005)πŸ“„
- [RELIC: Retrieving Evidence for Literary Claims](https://arxiv.org/abs/2203.10053)πŸ“„
- [Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations](https://arxiv.org/abs/2109.13059)πŸ“„
- [SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation](https://arxiv.org/abs/2205.08180)πŸ”Š
- [An Analysis of Fusion Functions for Hybrid Retrieval](https://arxiv.org/abs/2210.11934)πŸ“„
- [Out-of-distribution Detection with Deep Nearest Neighbors](https://arxiv.org/abs/2204.06507)
- [ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition](https://arxiv.org/abs/2210.13352)πŸ”Š
- [Analyzing Acoustic Word Embeddings From Pre-Trained Self-Supervised Speech Models](https://arxiv.org/pdf/2210.16043.pdf))πŸ”Š
- [Rethinking with Retrieval: Faithful Large Language Model Inference](https://arxiv.org/abs/2301.00303)πŸ“„
- [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496.pdf)πŸ“„
- [Transformer Memory as a Differentiable Search Index](https://arxiv.org/abs/2202.06991)πŸ“„

### 2023
- [FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search](https://dl.acm.org/doi/10.1145/3543507.3583318)πŸ“„
- [β€œLow-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426/)πŸ“„
- [SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) πŸ“„

## Articles
- [Tackling Semantic Search](https://adityamalte.substack.com/p/tackle-semantic-search/)
- [Semantic search in Azure Cognitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview)
- [How we used semantic search to make our search 10x smarter](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/)
- [Stanford AI Blog : Building Scalable, Explainable, and Adaptive NLP Models with Retrieval](https://ai.stanford.edu/blog/retrieval-based-NLP/)
- [Building a semantic search engine with dual space word embeddings](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90)
- [Billion-scale semantic similarity search with FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2)
- [Some observations about similarity search thresholds](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
- [Near Duplicate Image Search using Locality Sensitive Hashing](https://keras.io/examples/vision/near_dup_search/)
- [Free Course on Vector Similarity Search and Faiss]( https://link.medium.com/HtFoFKlKvkb)
- [Comprehensive Guide To Approximate Nearest Neighbors Algorithms](https://link.medium.com/V62Z8drvEkb)
- [Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email)
- [Argilla Semantic Search](https://docs.argilla.io/en/latest/guides/features/semantic-search.html)
- [Co:here's Multilingual Text Understanding Model](https://txt.cohere.ai/multilingual/)
- [Simplify Search woth Multilingual Embedding Models](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/)

## Libraries and Tools
- [fastText](https://fasttext.cc/)
- [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4)
- [SBERT](https://www.sbert.net/)
- [ELECTRA](https://github.com/google-research/electra)
- [LaBSE](https://tfhub.dev/google/LaBSE/2)
- [LASER](https://github.com/facebookresearch/LASER)
- [Relevance AI - Vector Platform From Experimentation To Deployment](https://relevance.ai)
- [Haystack](https://github.com/deepset-ai/haystack/)
- [Jina.AI](https://jina.ai/)
- [pinecone](https://www.pinecone.io/)
- [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com)
- [ranx](https://github.com/AmenRa/ranx)
- [BEIR :Benchmarking IR](https://github.com/UKPLab/beir)
- [RELiC: Retrieving Evidence for Literary Claims Dataset](https://relic.cs.umass.edu/)
- [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py)
- [deep_text_matching](https://github.com/wangle1218/deep_text_matching)
- [Which Frame?](http://whichframe.com/)
- [lexica.art](https://lexica.art/)
- [emoji semantic search](https://github.com/lilianweng/emoji-semantic-search)
- [PySerini](https://github.com/castorini/pyserini)
- [BERTSerini](https://github.com/rsvp-ai/bertserini)
- [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity)
- [milvus](https://www.milvus.io/)
- [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/)
- [weaviate](https://github.com/semi-technologies/weaviate)
- [semantic-search-through-wikipedia-with-weaviate](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate)
- [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search)
- [same.energy](https://www.same.energy/about)
- [ann benchmarks](http://ann-benchmarks.com/)
- [scaNN](https://github.com/google-research/google-research/tree/master/scann)
- [REALM](https://github.com/google-research/language/tree/master/language/realm)
- [annoy](https://github.com/spotify/annoy)
- [pynndescent](https://github.com/lmcinnes/pynndescent)
- [nsg](https://github.com/ZJULearning/nsg)
- [FALCONN](https://github.com/FALCONN-LIB/FALCONN)
- [redis HNSW](https://github.com/zhao-lang/redis_hnsw)
- [autofaiss](https://github.com/criteo/autofaiss)
- [DPR](https://github.com/facebookresearch/DPR)
- [rank_BM25](https://github.com/dorianbrown/rank_bm25)
- [FlashRank](https://github.com/PrithivirajDamodaran/FlashRank)
- [nearPy](http://pixelogik.github.io/NearPy/)
- [vearch](https://github.com/vearch/vearch)
- [vespa](https://github.com/vespa-engine/vespa)
- [PyNNDescent](https://github.com/lmcinnes/pynndescent)
- [pgANN](https://github.com/netrasys/pgANN)
- [Tensorflow Similarity](https://github.com/tensorflow/similarity)
- [opensemanticsearch.org](https://www.opensemanticsearch.org/)
- [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search)
- [searchy](https://github.com/lubianat/searchy)
- [txtai](https://github.com/neuml/txtai)
- [HyperTag](https://github.com/Ravn-Tech/HyperTag)
- [vectorai](https://github.com/vector-ai/vectorai)
- [embeddinghub](https://github.com/featureform/embeddinghub)
- [AquilaDb](https://github.com/Aquila-Network/AquilaDB)
- [STripNet](https://github.com/stephenleo/stripnet)

## Datasets
- [Semantic Text Similarity Dataset Hub](https://github.com/brmson/dataset-sts)
- [Facebook AI Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo)
- [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit)
- [BEIR](https://github.com/beir-cellar/beir)
- MTEB

## Milestones

Have a look at the [project board](https://github.com/Agrover112/awesome-semantic-search/projects/1) for the task list to contribute to any of the open issues.