Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/Agrover112/awesome-semantic-search

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
https://github.com/Agrover112/awesome-semantic-search
List: awesome-semantic-search
awesome awesome-list hacktoberfest information-retrieval information-retrival nlp ranking semantic-search semantic-similarity sentence-embeddings
Last synced: about 2 months ago
JSON representation
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
Host: GitHub
URL: https://github.com/Agrover112/awesome-semantic-search
Owner: Agrover112
License: cc0-1.0
Created: 2021-03-29T18:27:01.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-12-06T14:48:59.000Z (7 months ago)
Last Synced: 2024-05-14T04:01:35.715Z (about 2 months ago)
Topics: awesome, awesome-list, hacktoberfest, information-retrieval, information-retrival, nlp, ranking, semantic-search, semantic-similarity, sentence-embeddings
Homepage:
Size: 371 KB
Stars: 322
Watchers: 9
Forks: 27
Open Issues: 6
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
Lists

awesome-stars - Agrover112/awesome-semantic-search - A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks. (Others)
awesome-stars - Agrover112/awesome-semantic-search - A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks. (Others)
ultimate-awesome - awesome-semantic-search - A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks. (Other Lists / Julia Lists)
README

        # Awesome Semantic-Search [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)  [![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-yellow.svg)](https://conventionalcommits.org)



Logo made by [@createdbytango](https://instagram.com/createdbytango). 

**Looking for More Paper Additions.

PS: Raise a PR**

Following repository aims to serve a meta-repository for [Semantic Search](https://en.wikipedia.org/wiki/Semantic_search) and [Semantic Similarity](http://nlpprogress.com/english/semantic_textual_similarity.html) related tasks.

Semantic Search isn't limited to text! It can be done with images, speech, etc.There are numerous different use-cases and applications of semantic search.

Feel free to raise a PR on this repo!

## Contents

- [Papers](#papers)

    - [2014](#2014)

    - [2015](#2015)

    - [2016](#2016)

    - [2017](#2017)

    - [2018](#2018)

    - [2019](#2019)

    - [2020](#2020)

    - [2021](#2021)

    - [2022](#2022)

    - [2023](#2023)

- [Articles](#articles)

- [Libraries and Tools](#libraries-and-tools)

- [Datasets](#datasets)

- [Milestones](#milestones)

## Papers

### 2010

- [Priority Range Trees](https://arxiv.org/abs/1009.3527)

- [Information Retrieval and the semantic web](https://ieeexplore.ieee.org/document/5607549) 📄

### 2014 

- [A Latent Semantic Model with Convolutional-Pooling 

Structure for Information Retrieval](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄

### 2015

- [Skip-Thought Vectors](https://arxiv.org/pdf/1506.06726.pdf) 📄

- [Practical and Optimal LSH for Angular Distance](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html)

### 2016

- [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759) 📄

- [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) 📄

- [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/abs/1603.09320)

- [On Approximately Searching for Similar Word Embeddings](https://www.aclweb.org/anthology/P16-1214.pdf) 

- [Learning Distributed Representations of Sentences from Unlabelled Data](https://arxiv.org/abs/1602.03483)📄

- [Approximate Nearest Neighbor Search on High Dimensional Data --- Experiments, Analyses, and Improvement](https://arxiv.org/abs/1610.02455)

### 2017

- [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄

- [Semantic Textual Similarity For Hindi](https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b)📄

- [Efficient Natural Language Response Suggestion for Smart Reply](https://arxiv.org/abs/1705.00652)📃

### 2018

- [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) 📄

- [Learning Semantic Textual Similarity from Conversations](https://arxiv.org/pdf/1804.07754.pdf) 📄

- [Google AI Blog: Advances in Semantic Textual Similarity](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄

- [Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech](https://arxiv.org/abs/1803.08976))🔊

- [Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data](https://arxiv.org/abs/1810.07355) 🔊

- [Fast Approximate Nearest Neighbor Search With The

Navigating Spreading-out Graph](http://www.vldb.org/pvldb/vol12/p461-fu.pdf)

- [The Case for Learned Index Structures](https://dl.acm.org/doi/10.1145/3183713.3196909)

### 2019

- [LASER: Language Agnostic Sentence Representations](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄

- [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375) 📄

- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf) 📄

- [Multi-Stage Document Ranking with BERT](https://arxiv.org/abs/1910.14424) 📄

- [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/abs/1906.00300)

- [End-to-End Open-Domain Question Answering with BERTserini](https://www.aclweb.org/anthology/N19-4013/)

- [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)📄

- [Analyzing and Improving Representations with the Soft Nearest Neighbor Loss](https://arxiv.org/pdf/1902.01889.pdf)📷

- [DiskANN: Fast Accurate Billion-point Nearest

Neighbor Search on a Single Node](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)

### 2020

- [Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned](https://arxiv.org/abs/2004.05125) 📄

- [PASSAGE RE-RANKING WITH BERT](https://arxiv.org/pdf/1901.04085.pdf) 📄

- [CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization](https://arxiv.org/pdf/2006.09595.pdf) 📄

- [LaBSE:Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) 📄

- [Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset](https://arxiv.org/abs/2007.07846) 📄

- [DeText: A deep NLP framework for intelligent text understanding](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄

- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf) 📄

- [Pretrained Transformers for Text Ranking: BERT and Beyond](https://arxiv.org/abs/2010.06467) 📄

- [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)

- [ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS](https://openreview.net/pdf?id=r1xMH1BtvB)📄

- [Improving Deep Learning For Airbnb Search](https://arxiv.org/pdf/2002.05515)

- [Managing Diversity in Airbnb Search](https://arxiv.org/abs/2004.02621)📄

- [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval](https://arxiv.org/abs/2007.00808v1)📄

- [Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)📷

- [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659)📄

### 2021

- [Hybrid approach for semantic similarity calculation  between Tamil words](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) 📄

- [Augmented SBERT](https://arxiv.org/pdf/2010.08240.pdf) 📄

- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) 📄

- [Compatibility-aware Heterogeneous Visual Search](https://arxiv.org/abs/2105.06047) 📷

- [Learning Personal Style from Few Examples](https://chuanenlin.com/personalstyle)📷

- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979)📄

- [A Survey of Transformers](https://arxiv.org/abs/2106.04554)📄📷

- [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://dl.acm.org/doi/10.1145/3404835.3463098)📄

- [High Quality Related Search Query Suggestions using Deep Reinforcement Learning](https://arxiv.org/abs/2108.04452v1)

- [Embedding-based Product Retrieval in Taobao Search](https://arxiv.org/pdf/2106.09297.pdf)📄📷

- [TPRM: A Topic-based Personalized Ranking Model for Web Search](https://arxiv.org/abs/2108.06014)📄

- [mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset](https://arxiv.org/abs/2108.13897)📄

- [Database Reasoning Over Text](https://aclanthology.org/2021.acl-long.241.pdf)📄

- [How Does Adversarial Fine-Tuning Benefit BERT?](https://arxiv.org/abs/2108.13602))📄

- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)📄

- [Primer: Searching for Efficient Transformers for Language Modeling](https://arxiv.org/abs/2109.08668)📄

- [How Familiar Does That Sound? Cross-Lingual Representational

Similarity Analysis of Acoustic Word Embeddings](https://arxiv.org/pdf/2109.10179.pdf)🔊

- [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821#)📄

- [Compositional Attention: Disentangling Search and Retrieval](https://arxiv.org/abs/2110.09419)📄📷

- [SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search](https://arxiv.org/abs/2111.08566)

- [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577) 📄

- [Generative Search Engines: Initial Experiments](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) 📷

- [Rethinking Search: Making Domain Experts out of Dilettantes](https://dl.acm.org/doi/10.1145/3476415.3476428)

-[WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach](https://arxiv.org/abs/2104.01767)

### 2022

- [Text and Code Embeddings by Contrastive Pre-Training](https://arxiv.org/abs/2201.10005)📄

- [RELIC: Retrieving Evidence for Literary Claims](https://arxiv.org/abs/2203.10053)📄

- [Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations](https://arxiv.org/abs/2109.13059)📄

- [SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation](https://arxiv.org/abs/2205.08180)🔊

- [An Analysis of Fusion Functions for Hybrid Retrieval](https://arxiv.org/abs/2210.11934)📄

- [Out-of-distribution Detection with Deep Nearest Neighbors](https://arxiv.org/abs/2204.06507)

- [ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition](https://arxiv.org/abs/2210.13352)🔊

- [Analyzing Acoustic Word Embeddings From Pre-Trained Self-Supervised Speech Models](https://arxiv.org/pdf/2210.16043.pdf))🔊

- [Rethinking with Retrieval: Faithful Large Language Model Inference](https://arxiv.org/abs/2301.00303)📄

- [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496.pdf)📄

- [Transformer Memory as a Differentiable Search Index](https://arxiv.org/abs/2202.06991)📄

### 2023

- [FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search](https://dl.acm.org/doi/10.1145/3543507.3583318)📄

- [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426/)📄

- [SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) 📄

## Articles

- [Tackling Semantic Search](https://adityamalte.substack.com/p/tackle-semantic-search/)

- [Semantic search in Azure Cognitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview)

- [How we used semantic search to make our search 10x smarter](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/)

- [Stanford AI Blog : Building Scalable, Explainable, and Adaptive NLP Models with Retrieval](https://ai.stanford.edu/blog/retrieval-based-NLP/)

- [Building a semantic search engine with dual space word embeddings](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90)

- [Billion-scale semantic similarity search with FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2)

- [Some observations about similarity search thresholds](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)

- [Near Duplicate Image Search using Locality Sensitive Hashing](https://keras.io/examples/vision/near_dup_search/)

- [Free Course on Vector Similarity Search and Faiss]( https://link.medium.com/HtFoFKlKvkb)

- [Comprehensive Guide To Approximate Nearest Neighbors Algorithms](https://link.medium.com/V62Z8drvEkb)

- [Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email)

- [Argilla Semantic Search](https://docs.argilla.io/en/latest/guides/features/semantic-search.html)

- [Co:here's Multilingual Text Understanding Model](https://txt.cohere.ai/multilingual/)

- [Simplify Search woth Multilingual Embedding Models](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/)

## Libraries and Tools

- [fastText](https://fasttext.cc/)

- [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4)

- [SBERT](https://www.sbert.net/)

- [ELECTRA](https://github.com/google-research/electra)

- [LaBSE](https://tfhub.dev/google/LaBSE/2)

- [LASER](https://github.com/facebookresearch/LASER)

- [Relevance AI - Vector Platform From Experimentation To Deployment](https://relevance.ai)

- [Haystack](https://github.com/deepset-ai/haystack/)

- [Jina.AI](https://jina.ai/)

- [pinecone](https://www.pinecone.io/)

- [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com)

- [ranx](https://github.com/AmenRa/ranx)

- [BEIR :Benchmarking IR](https://github.com/UKPLab/beir)

- [RELiC: Retrieving Evidence for Literary Claims Dataset](https://relic.cs.umass.edu/)

- [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py)

- [deep_text_matching](https://github.com/wangle1218/deep_text_matching)

- [Which Frame?](http://whichframe.com/)

- [lexica.art](https://lexica.art/)

- [emoji semantic search](https://github.com/lilianweng/emoji-semantic-search)

- [PySerini](https://github.com/castorini/pyserini)

- [BERTSerini](https://github.com/rsvp-ai/bertserini)

- [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity)

- [milvus](https://www.milvus.io/)

- [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/)

- [weaviate](https://github.com/semi-technologies/weaviate)

- [semantic-search-through-wikipedia-with-weaviate](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate)

- [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search)

- [same.energy](https://www.same.energy/about)

- [ann benchmarks](http://ann-benchmarks.com/)

- [scaNN](https://github.com/google-research/google-research/tree/master/scann)

- [REALM](https://github.com/google-research/language/tree/master/language/realm)

- [annoy](https://github.com/spotify/annoy)

- [pynndescent](https://github.com/lmcinnes/pynndescent)

- [nsg](https://github.com/ZJULearning/nsg)

- [FALCONN](https://github.com/FALCONN-LIB/FALCONN)

- [redis HNSW](https://github.com/zhao-lang/redis_hnsw)

- [autofaiss](https://github.com/criteo/autofaiss)

- [DPR](https://github.com/facebookresearch/DPR)

- [rank_BM25](https://github.com/dorianbrown/rank_bm25)

- [FlashRank](https://github.com/PrithivirajDamodaran/FlashRank)

- [nearPy](http://pixelogik.github.io/NearPy/)

- [vearch](https://github.com/vearch/vearch)

- [vespa](https://github.com/vespa-engine/vespa)

- [PyNNDescent](https://github.com/lmcinnes/pynndescent)

- [pgANN](https://github.com/netrasys/pgANN)

- [Tensorflow Similarity](https://github.com/tensorflow/similarity)

- [opensemanticsearch.org](https://www.opensemanticsearch.org/)

- [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search)

- [searchy](https://github.com/lubianat/searchy)

- [txtai](https://github.com/neuml/txtai)

- [HyperTag](https://github.com/Ravn-Tech/HyperTag)

- [vectorai](https://github.com/vector-ai/vectorai)

- [embeddinghub](https://github.com/featureform/embeddinghub)

- [AquilaDb](https://github.com/Aquila-Network/AquilaDB)

- [STripNet](https://github.com/stephenleo/stripnet)

## Datasets

- [Semantic Text Similarity Dataset Hub](https://github.com/brmson/dataset-sts)

- [Facebook AI Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo)

- [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit)

- [BEIR](https://github.com/beir-cellar/beir)

- MTEB

## Milestones

Have a look at the [project board](https://github.com/Agrover112/awesome-semantic-search/projects/1) for the task list to contribute to any of the open issues.