Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/esteininger/vector-search

The definitive guide to using Vector Search to solve your semantic search production workload needs.
https://github.com/esteininger/vector-search

lucene nlp search-engine vector-search

Last synced: 3 days ago
JSON representation

The definitive guide to using Vector Search to solve your semantic search production workload needs.

Awesome Lists containing this project

README

        

## Vector Search

Vector Search engines provide the ability for developers to `store` vectors structured around certain algorithms (i.e. KNN), and an engine to `compute` similar vectors (like cosine distance) to determine which vectors are related.

This repository provides a comprehensive overview of the vector search landscape inclusive of tutorials, guides, best-practices, and extended learning. Please review the [Education](https://github.com/esteininger/vector-search#education) section to learn more.

Here is how you may use a Vector Search engine within your application search architecture:

## Topics

🧑‍🏫 **Foundation** - Learn the core concepts of vector-based information retrieval.

🎬 **Use Cases** - Understand where it makes sense to use vector search.

💵 **Architecture** - Guides on how to use vector search in your architecture.

## Foundations

| # | Label | Description |
|:--|:--------------------------------------------------------------------|:---|
| 1 | Keyword vs Vector Search | The difference between standard (TF-IDF) text search and vector search and when to use each. |
| 2 | [Sparse Vector Tutorial](/foundations/sparse-vector-tutorial) | A walkthrough of building your own sparse vector feature extraction engine. |
| 3 | Dense Vector Tutorial | A walkthrough of building your own dense vector feature extraction engine. |
| 4 | [Atlas Vector Search Engine](/foundations/atlas-vector-search) | Guides that showcase MongoDB Atlas' vector search implementation. |
| 5 | [Vector Search Comparisons](/foundations/vector-search-comparisons) | A comparison of the most popular vector search engines. |

## Use Cases

| # | Label | Description |
|:--|:------------------------------------------------------------|:-----------|
| 1 | Sentence Similarity | Determination of how similar to texts are. |
| 2 | Token Classification | Classification of text into pre-defined categories. |
| 3 | [Question and Answering](/use-cases/question-and-answering) | Building systems that automatically answer questions. |
| 4 | Personalization | Using client data to personalize query results. |
| 5 | Automated Synonym Creation | Enriching synonyms collection automatically. |
| 6 | Summarization | Reconstruction of a corpus into less words. |
| 7 | Conversational | Dialogue response generation. |
| 8 | [File Search](https://mixpeek.com/) | Search the contents of files across multiple modalities |

## Architecture

[One-click model deployment that never leaves your AWS account](https://climb.dev)

| # | Source | Description |
|:--|:-----------------------|:------------------------------------------------|
| 1 | [Reference Architecture](https://esteininger.medium.com/vertical-integration-is-key-to-winning-the-ai-race-44c8e4bd3b30) | Common best-practices for deploying vector search architecture in production. |
| 2 | Model Hosting | Suggestions on how to host your vector models. |
| 3 | Model Versioning | Common best-practices for versioning your models as they evolve. |
| 4 | [Feedback Loops](https://learn.mixpeek.com/your-search-bar-is-a-product-roadmap-goldmine/) | Query re-ranking, learn-to-rank and more. |
| 5 | Selecting Models | Which model supports your domain-specific tasks best? |

## Education

Although a challenging topic to grasp, there's a myriad of educational resources at your disposal.

### Information Retrieval

Overarching field of education.

- [A Primer on Neural Network Models for Natural Language Processing](https://u.cs.biu.ac.il/~yogo/nnlp.pdf)
- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)
- [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/)

### Transformer Architecture

Latest breakthrough in the area of converting human content (text, images, etc.) into vector representations.
Transformers are a deep learning model that utilize "self-attention", and differentially weigh the significance of each part of the input data.

- [Transformers by Pinecone](https://www.pinecone.io/learn/transformers/)
- [Transformers by Huggingface](https://aclanthology.org/2020.emnlp-demos.6.pdf)
- [Attention Is All You Need Research Paper](https://arxiv.org/pdf/1706.03762.pdf)
- [BERT Research Paper](https://arxiv.org/pdf/1810.04805.pdf)

### Similarity Search

In order to determine what is deemed relevant, computers need to measure the distance between points, in this case vectors.

- [Lucene KNN MVP](https://issues.apache.org/jira/browse/LUCENE-9004) & [Follow-Up](https://issues.apache.org/jira/browse/LUCENE-10054)
- [Google Vector Search](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology)
- [HNSW Graphs Research Paper](https://arxiv.org/abs/1603.09320)

## Gratitude

This repository wouldn't be possible without several key individuals:

- [Nick Gogan](https://github.com/nickgogan)
- [Marcus Eagan](https://github.com/MarcusSorealheis)

## Watch for Changes

This is a living repository and will evolve as I learn and the landscape changes. Please subscribe to changes accordingly via: